Creating a sitemap in Express.js from a Solr database

Sitemaps are a way to provide extra information to search engines about the structure of a website, to provide them a list of pages to crawl. They also allow you to give the search engine some rough guidance about how often to re-crawl. They seem to have moved under the umbrella of microformats since the structure is well-documented on schema.org.

You can see an example of what these look like below, which I blatantly copied from the above link:



   
      http://www.example.com/
      2005-01-01
      monthly
      0.8
   

If you build a site where most of the content is sourced from a database, the number of pages can be a bit of a nebulous concept, and you get a lot of say in what is included. You want to strike a balance between pumping a bunch of garbage into search engines and wasting your server resources, while making sure they have a thorough index of whatever materials you are providing.

For the sake of argument, lets say we define a function that takes a URL and prints out the URL blocks above:

function writeUrl(response, url) {
  let now = Date()
  response.write(`    
      ` + url.trim() + `
      ` + moment().format('YYYY-MM-DD') + `
      monthly
      0.8
   
`)
}

This uses the date the sitemap was generated, which would encourage a lot of recrawling, unless the sitemap itself is cached. If your data doesn’t change often, you should design your application so that it knows when it was built or deployed, so that you can use that instead. Or, if your data has a last update date, you can use that, although as we will see, that may not be straightforward.

In a simple website, you’re just reflecting rows in the database back into individual pages. If you back a website with something like solr, the “query” is more open ended. As a useful concept, you might create facets on each column, which gives you lists of unique values you can use to generate URLs.

An example url:

let url = 
  '/ssl_certificates/*:*?' + 
  'rows=0&start=0&' + 
  'facet.fieldcipherSuite&' + 
  'facet.field=level"

Well, in this case I have an application proxying through to solr, so the requests look a little different, but you get the idea- the key point is not to retrieve rows, just lists of facet values.

Once you get this, you can easily write a function that takes the request/response from an Express.js page and writes out a sitemap for all the provided facets.

This is complicated a little bit by the API to hit Solr, which in this case chunks the responses, when we want them merged, as well as some idiosyncratic behavior of Solr. That said, once you write something like this, you probably won’t ever have to change it again.

http.get({
  host: 'localhost', 
  port: 3000, 
  path: url
}, 
  (res) => {
    res.setEncoding('utf8');
    let data = '';
    
    res.on('data', (chunk) => {
      data += chunk;
    });
    
    res.on('end', () => {
      let searchResults = JSON.parse(data);
       
      response.setHeader('Content-Type', 'text/html');
      response.writeHead(200);
      
      response.write(`
  
`);
    _.each(
      searchResults.facet_counts.facet_fields,
      (values, key) => {
        let facetValues =
          _.filter(
            values,
            (value, idx) => idx % 2 === 0
          );
            
        _.each(facetValues, (value) => {
          writeUrl(response, '/ssl_search/' + key + ':' + escape(value) + "\n");      
          writeUrl(response, '/ssl_search/' + escape(value) + "\n");
        })
      });
      
    response.write(`   
`);
      
    response.end();
  });
}

Leave a Reply

Your email address will not be published. Required fields are marked *