{"id":5515,"date":"2017-10-30T01:50:37","date_gmt":"2017-10-30T01:50:37","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=5515"},"modified":"2017-10-30T01:50:37","modified_gmt":"2017-10-30T01:50:37","slug":"concept-search","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/concept-search\/","title":{"rendered":"Concept Search"},"content":{"rendered":"<p>Concept search allows you to query documents for words with a meaning similar to your search terms. Let&#8217;s look at a couple examples:<\/p>\n<pre>Writing NOT Code\n<\/pre>\n<p>This query implies that we should exclude or de-rank documents with phrases like \u201cwriting css\u201d or \u201cwriting php\u201d, preferring results with &#8220;poetry&#8221;, &#8220;fiction&#8221;, or \u201ccopyediting.\u201d <\/p>\n<p>This scenario is a ticket I received from a user of <a href=\"https:\/\/www.findlectures.com\/\">www.findlectures.com<\/a>. People want ways to research the latest developments in their field, without search results being cluttered with talks that mention their interests in passing.<\/p>\n<p>For a more complex example, lets say we were searching recipes:<\/p>\n<pre>Vegetarian Food NOT Dairy \n<\/pre>\n<p>&#8220;Vegan cooking&#8221; would be an appropriate result for this &#8211; as would be recipes for people with food allergies.<\/p>\n<p>The negative search for dairy implies a hierarchy of items we wish to remove: milk, cheese and any specific types, components, or brands of these (e.g. Parmesan, whey, Kraft, respectively). It&#8217;s unlikely that the searcher has thought this far into what they want, but implicitly expect it.<\/p>\n<p>For a final example:<\/p>\n<pre>\nPhiladelphia, History\n<\/pre>\n<p>This search implies a geography and time range. <a href=\"http:\/\/reason.com\/archives\/2017\/10\/14\/delawares-odd-beautiful-conten\">This article about the founding of Arden, Delaware<\/a>, would be an appropriate match because it is nearby.<\/p>\n<h2>Wordnet<\/h2>\n<p>The notion of a concept hierarchy brings the WordNet<sup><a href=\"#footnote_0_5515\" id=\"identifier_0_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/wordnet.princeton.edu\/\">1<\/a><\/sup> database to mind, although it is not sufficiently detailed for our purposes. Let&#8217;s suppose you wanted to search for talks, excluding politics and religion. If you start at &#8220;religion&#8221; and work down the tree, you about 200 results:<\/p>\n<pre>'religion', 'faith', 'organized religion', 'taoism', 'shinto', 'sect', 'religious sect',\n'religious order', 'zurvanism', 'waldenses', 'vaudois', 'vaishnavism',  'vaisnavism', 'sunni',\n'sunni islam', 'sisterhood', 'shuha shinto', 'shua', 'shivaism', 'sivaism', 'shiah', 'shia',\n'shiah islam', 'shaktism', 'saktism', 'shakers', \n'united society of believers in christ\\'s second appearing', 'religious society of friends', \n'society of friends', 'quakers', 'order', 'monastic order', 'society of jesus', 'jesuit order', \n'franciscan order', 'dominican order', 'carthusian order', 'carmelite order', \n'order of our lady of mount carmel', 'benedictine order', 'order of saint benedict', \n'augustinian order', 'austin friars', 'augustinian hermits', 'augustinian canons', \n'kokka shinto', 'kokka', 'karaites', 'jainism', 'high church', 'high anglican church',  \n'haredi', 'hare krishna', 'international society for krishna consciousness', 'iskcon', \n'brethren', 'amish sect', 'albigenses', 'cathars', 'cathari', 'abecedarian', 'scientology', \n'church of scientology', 'khalsa', 'judaism', 'hebraism', 'jewish religion', 'reform judaism', \n'orthodox judaism', 'jewish orthodoxy', 'hasidim', 'hassidim',  'hasidism', 'chasidim', \n'chassidim', 'conservative judaism', 'hinduism', 'hindooism', 'brahmanism', 'brahminism', \n'established church', 'cult', 'cargo cult', 'wicca',  'voodoo', 'rastafarian', 'rastafari', \n'rastas', 'obeah', 'obi', 'macumba', 'church', 'christian church', 'unification church', \n'protestant church', 'protestant', 'pentecostal religion', 'nestorian church', 'coptic church', \n'catholic church', ...142 more]\n<\/pre>\n<p>This list is pretty good, but far too short &#8211; Wikipedia&#8217;s list of Christian sects includes nearly 200 entries just under &#8220;types of Baptist&#8221;<sup><a href=\"#footnote_1_5515\" id=\"identifier_1_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/en.wikipedia.org\/wiki\/List_of_Christian_denominations\">2<\/a><\/sup>. <\/p>\n<p>The tree of concepts for politics has nearly nothing. Of the results it returns, none of these are what people mean by &#8220;no politics&#8221;. A political talk is determined by time &#8211; whatever is in the news now, where I&#8217;m at, is political today, and history tomorrow.<\/p>\n<pre>\n'political sympathies', 'political science', 'government', 'realpolitik', 'practical politics', \n'geopolitics', 'geostrategy', 'political relation'\n<\/pre>\n<h2>Word2Vec<\/h2>\n<p>Machine learning techniques are a compelling alternative to using a database maintained by a team, because you can rely on a computer to find patterns, and update your model as new text becomes available. Word embeddings<sup><a href=\"#footnote_2_5515\" id=\"identifier_2_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/en.wikipedia.org\/wiki\/Word_embedding\">3<\/a><\/sup> are a compelling tool, Word2vec can discover implicit relationships, such as gender or country capitals. Varations of Word2vec represent word meanings from how they are used in context, using mathematical vectors.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-5440\" src=\"http:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4-300x140.png\" alt=\"image_4\" width=\"300\" height=\"140\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4-300x140.png 300w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4-1024x479.png 1024w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4-768x359.png 768w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4-1536x718.png 1536w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4-1200x561.png 1200w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/image_4.png 1619w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>Much of the practical research I&#8217;ve found on Word2Vec and search uses it to generate synonyms. There are a few papers suggesting that Word2Vec can discover other types of relationship<sup><a href=\"#footnote_3_5515\" id=\"identifier_3_5515\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/scholarworks.uno.edu\/cgi\/viewcontent.cgi?article=3003&amp;context=td\">4<\/a><\/sup> (e.g. more general\/specific terms). In a Word2vec model trained on English the &#8220;nearest&#8221; term to a noun is often the plural of the term, since those two terms are often used together. I expect this would be different in a language with cases (German, Greek, Latin), as related verbs and adjectives in change their endings to match.<\/p>\n<h2>Use Cases<\/h2>\n<p>I&#8217;m using concept search to generate <a href=\"https:\/\/www.findlectures.com\/form?type=alert\">personalized email\u00a0digests<\/a> resembling the excellent Cooper Press newsletters (<a href=\"http:\/\/javascriptweekly.com\/\">Javascript Weekly<\/a>, <a href=\"https:\/\/postgresweekly.com\/\">Postgres Weekly<\/a>, etc). Each email has unique articles from Reddit and conference talks from <a href=\"https:\/\/www.findlectures.com\">FindLectures.com<\/a>, chosen with machine learning.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5524\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2017\/10\/email2-1.png\" alt=\"email2\" width=\"358\" height=\"584\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/email2-1.png 358w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/email2-1-184x300.png 184w\" sizes=\"(max-width: 358px) 100vw, 358px\" \/><\/p>\n<p>Concept search shines for users who enter multiple search terms. For &#8220;python&#8221; and &#8220;machine learning&#8221;, we really want to see pieces about scikit-learn, Tensorflow, and Keras. If we enter &#8220;java&#8221; and &#8220;machine learning&#8221;, we instead expect to see work by people using Stanford NLP or Deeplearning4j.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5525\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2017\/10\/email1-1.png\" alt=\"email1\" width=\"383\" height=\"605\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/email1-1.png 383w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/email1-1-190x300.png 190w\" sizes=\"(max-width: 383px) 100vw, 383px\" \/><\/p>\n<h2>Rocchio Algorithm<\/h2>\n<p>Traditional full text search tools (Lucene) query for the presence or absence of terms, weighted by how often they occur. A variation of this is the simple and fast Rocchio Algorithm<sup><a href=\"#footnote_4_5515\" id=\"identifier_4_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/en.wikipedia.org\/wiki\/Rocchio_algorithm\">5<\/a><\/sup>. The Rocchio Algorithm essentially does the following:<\/p>\n<ol>\n<li>Run a Search<\/li>\n<li>Get common terms<\/li>\n<li>Run search again, using the terms you should have used in step 1<\/li>\n<\/ol>\n<p>This improves the quality of results, and it&#8217;s very fast. There is an excellent talk on this<sup><a href=\"#footnote_5_5515\" id=\"identifier_5_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/www.youtube.com\/watch?v=-uiQY2Zatjo&amp;index=31&amp;list=PLU6n9Voqu_1FMt0C-tVNFK0PBqWhTb2Nv\">6<\/a><\/sup> by Simon Hughes \/ Dice.com, who has a Solr plugin that implements the algorithm<sup><a href=\"#footnote_6_5515\" id=\"identifier_6_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/github.com\/DiceTechJobs\/RelevancyFeedback\">7<\/a><\/sup>.<\/p>\n<p>Here are the results when we query for articles submitted to Reddit on Python + Machine Learning:<\/p>\n<pre lang=\"python\">Using the scikit-learn machine learning library in Ruby using PyCall\nPythons Positive Press Pumps Pandas\nImage Recognition with Python, Clarifai and Twilio\nDeepSchool.io Open Source Deep Learning course\nKeep it simple! How to understand Gradient Descent algorithm\nML-From-Scratch: Library of bare bones Python implementations of Machine Learning models and algorithms\nEpoch vs Batch Size vs Iterations: Machine Learning\n<\/pre>\n<p>The Rocchio algorithm does a good job here, but I suspect that Word2Vec can do better because it maintains a concept of how similar two terms are.<\/p>\n<p>Understanding how Word2Vec defines similarity is foundational to the work we want to do with concepts: The distance between two term vectors is the cosine of the angle between them. This produces a score from 0 to 1. Like full text scores, higher is better, and it&#8217;s not mathematically valid to add the scores.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-5520\" src=\"http:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine-300x187.png\" alt=\"cosine\" width=\"300\" height=\"187\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine-300x187.png 300w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine-1024x638.png 1024w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine-768x479.png 768w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine-1536x957.png 1536w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine-1200x748.png 1200w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/cosine.png 1590w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<h2>Dataset<\/h2>\n<p>The dataset for <a href=\"https:\/\/www.findlectures.com\">FindLectures.com<\/a> includes machine transcriptions for most talks. When I ran Word2vec on this dataset it identified the nearest word to &#8220;code&#8221; as &#8220;coat&#8221; &#8211; a logical mistake for a machine. This suggests that Word2vec trained on articles could be used to improve machine transcription by incorporating the probability of a term in real usage. This paper<br \/>\n<sup><a href=\"#footnote_7_5515\" id=\"identifier_7_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/www.researchgate.net\/publication\/224319730_Improving_language_models_by_using_distant_information\">8<\/a><\/sup> shows a promising example, using French language newspaper text to improve transcription of broadcasts.<\/p>\n<h2>Synonyms<\/h2>\n<p>Let&#8217;s consider a simple change to the Rocchio algorithm: use synonyms suggested by Word2vec, but incorporate the distance from the source query to the synonym as a weighting factor. <\/p>\n<pre lang=\"scala\">\nList(\"python\", \"machine\", \"learning\").map(\n  (queryTerm) =>\n    \"(\" +\n      model.wordsNearest(\n        List(queryTerm), \/\/ positive terms\n        List(), \/\/ negative terms\n        25\n      ).map(\n        (nearWord) =>\n          \"transcript:\" + term2 +\n          \"^\" + \n            (1 + \n              (Math.PI - \n               Math.acos(\n                 model.similarity(nearWord, term2))))\n        ).mkString(\" OR \") \n     + \")\"\n).mkString(\" AND \")\n<\/pre>\n<p>This code uses the angle between terms as a boost. The following text shows the resulting Solr query looks like, The caret(^) is a boost. These boosts are multiplied by the weights Solr maintains internally (BM25)<sup><a href=\"#footnote_8_5515\" id=\"identifier_8_5515\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/opensourceconnections.com\/blog\/2015\/10\/16\/bm25-the-next-generation-of-lucene-relevation\/\">9<\/a><\/sup>. This accounts for how often a term occurs &#8211; really common synonyms will have essentially no effect on the output.<\/p>\n<pre>\ntitle_s:python^10 OR title_s:\"machine learning\"^10 \u2026\n(title_s: software^1.21 OR title_s:database^1.20 OR title_s:format^1.18 \ntitle_s:applications^1.14 OR title_s:browser^1.14 OR title_s:setup^1.13 \ntitle_s:bootstrap^1.13 OR title_s:in-class^1.13 OR title_s:campesina^1.12 OR \ntitle_s:excel^1.12 OR title_s:hardware^1.11 OR title_s:programming^1.11 OR\ntitle_s:api^1.11 OR title_s:prototype^1.11 OR title_s:middleware^1.11 OR \ntitle_s:openstreetmap^1.10 OR title_s:product^1.10 OR title_s:app^1.09 OR \ntitle_s:hbp^1.09 OR title_s:programmers^1.09 OR title_s:application^1.09 OR \ntitle_s:databases^1.09 OR title_s:idiomatic^1.09 OR title_s:spreadsheet^1.09 \nOR title_s:java^1.09 \u2026\nAND (\u2026)\n<\/pre>\n<p>Here are the results for Python + Machine Learning:<\/p>\n<pre>\nPython for Data Analysis\nThe \/r\/playrust Classifier: Real World Rust Data Science\nAndreas Mueller - Commodity Machine Learning\nJose Quesada - A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons\nYOW! 2016 Mark Hibberd & Ben Lever - Lab to Factory: Robust Machine Learning Systems\nHow To Get Started With Machine Learning? | Two Minute Papers\nA Gentle Introduction To Machine Learning\nBurc Arpat - Why Python is Awesome When Working With Data at any Scale\nLeverage R and Spark in Azure HDInsight for scalable machine learning\nMachine Learning with Scala on Spark  by Jose Quesada\n<\/pre>\n<p>Here are the results for just &#8220;writing&#8221;:<\/p>\n<pre>\nIs Nonfiction Literature?\n\"Oh, you liar, you storyteller\": On Fibbing, Fact and Fabulation \nThe Value of the Essay in the 21st Century\nMaking the Case for American Fiction: Connecting the Dots\nSiri Hustvedt in Conversation with Paul Auster\nIssues Related to the Teaching of Creative Writing\nAspen New York Book Series: The Art of the Memoir\nH.G. Adler - A Survivor's Dual Reverie\nSixth Annual Leon Levy Biography Lecture: David Levering Lewis\nContemporary Writing from Korea\n<\/pre>\n<p>These results look really good &#8211; we&#8217;ve removed all the &#8220;code&#8221; oriented results. <\/p>\n<p>In my first implementation I used the cosine similarity as the boost, rather than the angle it represented. This is almost as good, but it returned an article titled: &#8220;Re-writing, Re-reading, Re-thinking \u2013 Web Design in Words&#8221;. Clearly this has the correct terms, but is not actually about the topic we&#8217;re looking for.<\/p>\n<h2>Aboutness<\/h2>\n<p>To improve these results further, I&#8217;d like to measure whether the document is &#8220;about&#8221; the query terms. An easy way to do this is to average all the terms in the query and document, and compute the cosine similarity<br \/>\n<sup><a href=\"#footnote_9_5515\" id=\"identifier_9_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/arxiv.org\/pdf\/1602.01137v1.pdf\">10<\/a><\/sup>. Per the linked paper from Microsoft Research, this is a good technique if it is used to re-shuffle top results (i.e., you would not want to replace full text search with this).<\/p>\n<pre lang=\"scala\">\nval queryMean = model.getWordVectorsMean(List(\u201cwriting\u201d))\nval mean = model.getWordVectorsMean(NLP.getWords(document._1))\nval distance = Transforms.cosineSim(vec._2, queryMean)\n<\/pre>\n<p>Graphically, we&#8217;re comparing the &#8220;average&#8221; for the document to the &#8220;average&#8221; for the query:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2-300x177.png\" alt=\"\" width=\"300\" height=\"177\" class=\"aligncenter size-medium wp-image-5544\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2-300x177.png 300w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2-1024x604.png 1024w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2-768x453.png 768w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2-1536x906.png 1536w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2-1200x707.png 1200w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-2.png 1681w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>This is not fast &#8211; computing all of these term vectors and averaging them took 5 minutes 45 seconds on my machine with 16 parallel threads. This could be significantly improved by pre-computing the averages for each document and running on a GPU.<\/p>\n<p>Here are the new results for &#8220;writing&#8221; (the previous result, re-sorted by &#8220;aboutness&#8221;):<\/p>\n<pre>\nIssues Related to the Teaching of Creative Writing: 0.43\nAutobiography: 0.41\nContemporary Indian Writers: The Search for Creativity: 0.41\nMarjorie Welish: Lecture: 0.40\nHistory and Literature: The State of Play: A Roundtable Discussion: 0.40\nCritical Reading of Great Writers: Albert Camus: 0.40\nDaniel Schwarz: In Defense of Reading: 0.39\nThe Journey To The West by Professor Anthony C. Yu: 0.39\nBlogs, Twitter, the Kindle: The Future of Reading: 0.39\n<\/pre>\n<p>Again, these look good.<\/p>\n<h2>Overlapping Search Terms<\/h2>\n<p>Some People who set up email alerts enter all of their interests (Art, Hiking), and some enter terms that modify each other (Python, Programming). We need to identify whether each term is related so that we can choose between &#8220;AND&#8221; and &#8220;OR&#8221; in the queries we generate.<\/p>\n<p>A simple approach to this is to compute the distance between each query term and segment them into clusters. <\/p>\n<pre lang=\"scala\">\nterms.map(\n  (term1) =>\n    terms.map(\n      (term2) => (term1, term2)\n    )\n).flatten.filter(\n  (tuple) => tuple._1 < tuple._2\n).map(\n  (tuple) => \n    (tuple._1, tuple._2, w2v.model.get.similarity(tuple._1, tuple._2))\n)\n<\/pre>\n<p>Here are the distances for our example:<\/p>\n<pre>\ndistance(programming, python): 0.61\ndistance(art, hiking): 0.1\n<\/pre>\n<h2>Topic Diversity<\/h2>\n<p>While we&#8217;re now getting good results we often see articles on the same subject. For an email list dedicated to learning, it&#8217;s no use giving people the same article written multiple times.<\/p>\n<p>In one email a user who chose &#8220;writing&#8221; as a topic got these two talks back:<\/p>\n<pre>\nA Conversation with David Gerrold, Writer of Star Trek: The Trouble with Tribbles (58 minutes)\u00a0\nStar Trek: Science Fiction to Science Fact - STEM in 30 (28 minutes)\u00a0\n<\/pre>\n<p>Even worse, an alert for Python returned these results, which are all re-written versions of the same article:<\/p>\n<pre>\nPythons Positive Press Pumps\u00a0Pandas\u00a0\u00a0\nWhy is Python Growing So Quickly? \nPython explosion blamed on\u00a0pandas\u00a0\n<\/pre>\n<p>Improving the diversity of search results is a fascinating problem. If this was an e-commerce site, producing varied results for a broad search like &#8220;shoes&#8221; gives the users hints about what a site has to offer, as well as prompts to refine the search. <\/p>\n<p>Solr can do k-means clustering of documents around phrases it discovers<sup><a href=\"#footnote_10_5515\" id=\"identifier_10_5515\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/lucene.apache.org\/solr\/guide\/6_6\/result-clustering.html\">11<\/a><\/sup> using Carrot2<sup><a href=\"#footnote_11_5515\" id=\"identifier_11_5515\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/project.carrot2.org\/\">12<\/a><\/sup> &#8211; we could improve diversity by choosing a talk from each cluster.<\/p>\n<p>The clusters I get for &#8220;transcript:python&#8221; are as follows:<\/p>\n<pre>\nDeal with Unicode\nFalse Really Are Equal to True\nInteractive Debugger\nOutputs a Path\nPiece of Fortran\nPython 2.7 Point 10\nEmbedding Situation\nAndroid App\nAwesome Capability\nBinary Multiply\nDepended on Pandas have Pandas\n<\/pre>\n<p>Each of these clusters has a list of talks within it. Here is what we get if we pick a talk per cluster:<\/p>\n<pre>\nGoogle I\/O 2011: Python@Google\nPorting Django apps to Python 3\nPython, C, C++, and Fortran Relationship Status: It\u2019s Not That Complicated\nSupercharging C++ Code with Embedded Python\nRules for Radicals: Changing the Culture of Python at Facebook\nPython for Ruby Programmers\nSaturday Morning Keynote (Brett Cannon)\nAll Things Open 2013 | Jessica McKellar | Python Foundation\n<\/pre>\n<p>This approach is really fast, but difficult to combine with the prior techniques (pick N results, reshuffle).<\/p>\n<p>A Word2vec alternative could do this: Pick the top result, then find least related result. Average these two, then find the next most unrelated talk.<\/p>\n<p>We&#8217;re doing the following comparison iteratively:<br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3-300x156.png\" alt=\"\" width=\"300\" height=\"156\" class=\"aligncenter size-medium wp-image-5546\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3-300x156.png 300w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3-1024x533.png 1024w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3-768x400.png 768w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3-1536x800.png 1536w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3-1200x625.png 1200w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/10\/3-3.png 1903w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>Again this was a bit slow &#8211; 1 min, 30 seconds @ 16 parallel threads.<\/p>\n<p>To prove this works I changed the query to &#8220;python and pandas&#8221; to make it harder &#8211; that guarantees the original three articles show up.<\/p>\n<pre>\nPython explosion blamed on pandas: 1.0\nConsidering Python's Target Audience: 0.97\nAnimated routes with QGIS and Python: 0.97\nI can't get some SQL to commit reading data from a database: 0.97\nUsing Python to build an AI Twitter bot people trust: 0.96\nGetting a Job as a Self-Taught Python Developer: 0.96\nDownload and Process DEMs in Python: 0.96\nHow to mine newsfeed data and extract interactive insights in Python: 0.94\nDifferential Equation Solver In MATLAB, R, Julia, Python, C, Mathematica, \nMaple, and Fortran: 0.86\nMy personal data science toolbox written in Python: 0.75\n<\/pre>\n<h2>Conclusion<\/h2>\n<p>In general, each technique builds upon the last by obtaining the top results and re-shuffling. Adding more computing resources improves more relevance, but it takes some time to build and retrain Word2vec. In the next iteration of this, I&#8217;m intending to explore generating sequences of talks that build on each other, as well as dealing with geography (&#8220;history of philadelphia&#8221;).<\/p>\n<ol class=\"footnotes\"><li id=\"footnote_0_5515\" class=\"footnote\">https:\/\/wordnet.princeton.edu\/<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_0_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_1_5515\" class=\"footnote\">https:\/\/en.wikipedia.org\/wiki\/List_of_Christian_denominations<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_1_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_2_5515\" class=\"footnote\">https:\/\/en.wikipedia.org\/wiki\/Word_embedding<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_2_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_3_5515\" class=\"footnote\">http:\/\/scholarworks.uno.edu\/cgi\/viewcontent.cgi?article=3003&amp;context=td<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_3_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_4_5515\" class=\"footnote\">https:\/\/en.wikipedia.org\/wiki\/Rocchio_algorithm<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_4_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_5_5515\" class=\"footnote\">https:\/\/www.youtube.com\/watch?v=-uiQY2Zatjo&amp;index=31&amp;list=PLU6n9Voqu_1FMt0C-tVNFK0PBqWhTb2Nv<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_5_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_6_5515\" class=\"footnote\">https:\/\/github.com\/DiceTechJobs\/RelevancyFeedback<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_6_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_7_5515\" class=\"footnote\">https:\/\/www.researchgate.net\/publication\/224319730_Improving_language_models_by_using_distant_information<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_7_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_8_5515\" class=\"footnote\">http:\/\/opensourceconnections.com\/blog\/2015\/10\/16\/bm25-the-next-generation-of-lucene-relevation\/<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_8_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_9_5515\" class=\"footnote\">https:\/\/arxiv.org\/pdf\/1602.01137v1.pdf<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_9_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_10_5515\" class=\"footnote\">https:\/\/lucene.apache.org\/solr\/guide\/6_6\/result-clustering.html<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_10_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_11_5515\" class=\"footnote\">http:\/\/project.carrot2.org\/<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_11_5515\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>Exploring the issues around concept search<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[10],"tags":[121,185,348,385,517,599],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/5515"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=5515"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/5515\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=5515"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=5515"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=5515"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}