The purpose of building an SSL Search engine was to get a Solr UI up and going. There a few interesting future directions for research.
More importantly, I want to investigate better discovery of “interesting” material, rather than material that is directly focused on what you searched. There are a lot of excellent academic works on Amazon that have only one or two reviews because no one knows about them, but could be discovered through citations. Similiarly, a lot of small organizations have put good video lectures online, but many websites for navigating this content are either horrible or have a high noise to signal ratio. Youtube is well known to be focused on pushing content that is featured or already has high view counts. In the case of both Amazon and Youtube, the materials they push on you are determined based on what other people like, which naturally hides people who are up-and-coming writers or don’t fit the profile of what a good author is imagined to be.
Here are a few interesting use cases I’ve been thinking about:
1. A space enthusiast wants to find great photos of space exploration. Space agency websites typically have museum style pages designed to get you through a pre-defined story, or they give you a huge dump of nearly identical photos.
2. An artist wants to research museum collections for inspiration. Many of these organizations have a lot of pictures online, but it is hard to navigate, with a few exceptions1. You can use a regular search engine with a lot of pictures, but images are almost universally tagged with the wrong name / artist / medium, etc.
3. You are looking to purchase a logo or website, but lack the vocabulary to describe stylistic preferences (and have partially formed preferences). A search engine that lets you navigate through artistic periods to narrow your preferences would help (this is a problem that is well-handled by print books used as art school textbooks e.g. The History of the Illustrated Book.
3. A person interested in learning looks for lectures online. If you go through Youtube you will find a lot of mixed quality: good content with bad recordings, bad content with bad recordings, terrible speakers with good content, etc. These require you to manually curate your own material, the time cost of a bad line of research is much higher than something like image search.
4. A person looking for interesting books goes to Amazon and filters to highly rated books in various topics. There is a huge amount of good material, but if you apply your own knowledge it quickly becomes challenging to identify “interesting” content with a specific focus. E.g.: African history in English by non-western authors, books on deafness by deaf authors. It isn’t that these don’t exist, but the search engine doesn’t have data tagged in the right way, so you will never find it.
Aside from the notion of “interestingness”, there are several interesting things to research.
There are several companies now offering “AI as a service,” and are starting to return really good results on more sophisticated problems (see: Watson winning Jeopardy). The problem for researchers or small companies trying to use this has always been getting enough data to train an algorithm to do anything remotely interesting. IBM has purchased several companies to build an AWS style service, which includes APIs with usage based pricing for identifying objects in images, named entity recognition, translation, assigning a taxonomy to text, speech-to-text etc.
General purpose search engines are promoting features to allow tighter integrations into their services, notably microformats2, which let search engines offer datatype specific features. Google has been slowly adding these, which lets them show the profiles of politicians from their website / Wikipedia / Twitter / etc, headline images for top news or recipes, star ratings in search results, and download links in search results.
Accelerated mobile pages are an interesting push from Google as well3 which allow fast rendering of interactive sites on mobile devices.
From a technology standpoint, it is also a short hop from “search” to “alerts”, as alerts are just search results that change over time. On the UI side, some established solr UIs handle nested facets and show histograms of facet values for ranges, which lets you pack a lot of functionality into a physically small space.
The next iteration project will likely be a lecture search engine to discover interesting talks, but if you have any suggestions, please reply below!
Other essays in this series
- Part 1: Project Introduction
- Part 2: Lessons from the UI
- Part 3: Acquiring Data
- Part 4: Devops lessons
- Part 5: A look at the data