Normally when you do a personal research project it’s not installing it in a “correct” fashion, but for this SSL Search Engine, I decided to follow through, so I could see how Google treated it, and so that if I come up with a better dataset I’ll know how to deploy it. Normally I track the deployment process in a Google Doc, but this time I made blog posts for each section, which really helped my blog footprint for the project.
Deploying a dataset to Solr is super-easy – I just zip the core, and unzip them in the right location on the server – this works even if you’re going from Windows to Linux.
Installing Solr the first time is pretty easy, if you don’t need a highly available setup (install JDK + run Jetty). I’m a little concerned about how much memory it needs, so I spent some time researching VPS hosts that offer hosts with high memory to disk ratios. There are very few non-fly-by-night hosting companies that let you add lots of RAM, so I decided to stick with Linode, but this is something to monitor in the future.
The server side load will be fairly low, because I’ve set up server side rendering, and have Cloudflare cache everything, so Solr won’t even be hit most of the time. The data will update infrequently, and when the Google crawler hits pages on the site it should force Cloudflare to pre-load everything. In the future it might be interesting to pre-compute a bunch of pages, as this would eliminate Solr’s overhead entirely for those requests.
The primary challenge with Solr on a VPS is that you don’t want to expose it directly, or people could run arbitrary queries through the UI (e.g. deleting everything). I fixed this with iptables (firewall). There appear to be ways to do this within Jetty as well, but it doesn’t seem worth the risk of testing how well Jetty is implemented. One neat trick you can do is to forward ports when connecting through SSH, which allows you to SSH into the VPS and access the Solr UI directly, but not otherwise.
Getting the Node side of the app running was a little more painful that Solr – I found that if the app crashed in a certain way, it can take down my Apache server as well (this proxies requests to Node). If I ever get to the point where I want to run more of these applications, I’ll have to decide between running multiple node instances, or trying to keep different applications in sync. If I was building this machine from scratch I’d set up Dokku to allow multiple separate applications on the same host1.
Node has a utility to keep it running called “Forever” – if the app crashes Forever will restart it. This is necessary as any uncaught exception causes Node to shut down. Forever has an option to kill the script if it dies too many times in a short time , which seems ill-conceived given that a relentless spider (e.g. GoogleBot) could DOS you.
The people writing Forever also seem to be slowly re-implementing linux server tools. For instance, they have a log rotation feature now, that they didn’t used to have. Fortunately most of this type of web server administration problems are solved, e.g. I used the logrotate daemon to handle expiration of logs, rather than the Forever version of this.
Since Node runs with separate log files, you can easily track the pace GoogleBot indexes the application. Once I submitted a sitemap to Google, they crawled about 5,000 pages per day, and now I have a list of which ones cause failures (through the access logs). While Google indexed this relatively quickly (under a week) it took about two full weeks to show up in the index. Interestingly, it would climb by a few hundred per day, and then jumped to the full ~40,000. However, the full ~40,000 records would show up intermittently, and more frequently over time. This isn’t surprising given the scale of their infrastructure, but it’s interesting to see it in action.
Cloudflare sits in front of the whole application. In theory this was supposed to make everything simple, but caused the majority of the deployment issues. It caches just about everything that goes through, but it has the ability to minify files. The HTML minification does not agree with React, because React depends on the exact text of the HTML page matching when it renders on the client / server side (there is a discussion around changing this2 ). I suspect that in the long run I’d be better off not trying to run a WordPress site and a Node app on the same server + domain, because I may eventually need different cache settings per each app.
When I wrote this, I didn’t write an installation script – just copied all the files out the server, so webpack is unfortunately now installed there. It’d be nice to be able to easily disable server side rendering as this seems to be in conflict with “hot loading” in Webpack. It looks like the best way to handle this is to have separate Webpack files for development and production scenarios. This lets you setting all the production / minification flags for React, and turn off JSX/Typescript from running within Node.
Other essays in this series
- Part 1: Project Introduction
- Part 2: Lessons from the UI
- Part 3: Acquiring Data
- Part 5: A look at the data
- Part 6: A look forward