Lessons about search engines from building a Search Engine for X.509 certificates

When I deployed my SSL Search engine, one of the goals was to learn a little more about how search engines operate. While strictly speaking I don’t think Google wants to index the output of other search engines, I see this as more of an application with lists of items, and detail pages on each item. This is different from a “search” engine, because it includes pages for each entity (more like amazon), so these are the pages I’m trying to get Google to index. That said, SEO is more difficult than necessary if you use Javascript, even though this has improved with time.

When Google crawls your site, they identify themselves with a specific User Agent string:

"HTTP/1.1" 304 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

From a software architecture perspective, I didn’t build this to be easily indexable, so I was forced to provide Google with a sitemap. In the future I want to provide better URL addressability into the application, so that people can post links to specific pieces of state.

Sitemaps are limited, both by size and number of URLs – max 50,000 URLs and 10 MB. 50,000 URLs is much easier to hit. I submitted a sitemap with 40,000 URLs – I found that Google indexed approximately 5,000 per day (based on what I saw in the logs). It took about a week for these to start showing up in search results. At first, through that week a few more were visible per day, by the 10s-100s. At times, the amount would go down (presumably as this propagates through their network).

Google’s reported crawl statistics approximately match my experience watching the logs:

Google Webmaster tools shows the total amount indexed over time, which is quite interesting:

What you can see here is that it peaks at or above where I expected the number of posts to be, and then drops – this may be due to errors where the Node server crashes, but I can’t be certain.

The actual traffic from this is low, as expected; this is the lowest of low interest / long tail traffic. Despite being very competitive with other sites, it does get some traffic (~300 visitors per month).