This is the first of several essays on an ongoing research project: building a search engine to list X.509 (SSL) certificates. The first iteration of this project was a tool to search code to search for engineers by their area of expertise1, and a tool to search several cloud storage services at once (DropBox, OneDrive, and Google Docs).
There are a ton of free analysis tools for looking at SSL certificates, but they are nearly worthless in a corporate environment because they can’t hit internal services. The value of a search engine is to find domains with certificates like one you have, and to find out how often features age out. SSL, for instance, being replaced by TLS.
To find websites with HTTPS enabled, I started with a list of around 1 million domains and pulled the certificate for each from the “www” subdomain (note you could have different certificates on each subdomain). Certificates are well structured text file with lots of attributes, which allowed me to experiment with the UI for facets.
There are a couple generic user interfaces for search engines, although most for ElasticSearch, rather than Solr. The most well-known is Kibana2. Kibana seems to be a clear a derivative of the log monitoring tool Splunk3. There is also a tool called SearchKit4 which looks nice, and appears to be more of a UI control library.
The principle goals of this iteration of the project are to set up features to support discovery by search engines, and replicate the facet behavior used by Amazon and Newegg – in these two applications, the search facets represent a taxonomy of data. Each filter has checkboxes for different values (e.g. for hard drives, they have size ranges). When you filter by one value (say “0-500 GB”) it filters the search results, but still allows you to add more values (“500-1000 GB” etc). While conceptually simple, this ended up requiring a lot of fiddling with the UI to get it to work conistently.
The principle search experiment in this iteration is to build in a “detail” concept to the rendering, so that you get an indexable profile page for each certificate. I set up server side rendering with Node + React, and got Cloudflare in front (more effort than I expected but not a huge list of problems). To get Google to index these pages, you need to give them a sitemap, which is basically an XML file with URLs and how often you expect them to change – up to 50,000 URLs per file.
In the essays in this series that follow, I will discuss what I learned from this iteration of this research project, and what I think the next areas to research are. Roughly speaking, this will be divided into UI considerations, data loading lessons, operations, and what I learned from poking around the actual dataset. I suspect there is opportunity to build search tools that promote niche items based on some “interestingness” measures”, and so some future research will hopefully start uncovering novel concepts. Since this project is effectively “done”, you can also poke around with what I have.
Other essays in this series
- Part 2: Lessons from the UI
- Part 3: Acquiring Data
- Part 4: Devops lessons
- Part 5: A look at the data
- Part 6: A look forward