Tuning Solr (Lucene) Disk Usage

The initial index for FindLectures.com was huge – 9.1 GB for around 210,000 videos. Most Solr hosting services would charge $150-$300 / month for this size index. For some time I used BizSpark on Azure, which gives you $150 / month VM. When that ran out, I switched to a t2.medium VM on AWS, but this costs $50 / month.

This index stores video titles, lengths, descriptions, closed captions, some facets (topic, year), and a quality score.

I made several changes, which reduced disk usage from 9.1 GB to 210 MB.

  • Removed a feature that let you search for phrases in videos by timestamp – this doubled the amount of closed caption information (save 2 GB)
  • I initially marked all attributes as “stored”, for easy of debugging. Disabling this for the closed caption field saved 1.9GB.
  • Facet fields were being stored and indexed. You only need these indexed, and it’s not likely to be useful to store full text information about these. Removing this information on facets saved 1.8 GB (termVectors=”false” termPositions=”false” omitNorms=”true”).
  • Solr also stores positions of words in each document, which is useful for highlighting, or for reconstructing information not otherwise available at query time. This is again not useful for facets. Disabling position information saved 2 GB (termOffsets=”false” omitPositions=”true”)
  • All of the fields are copied into a single, shared field (_text_) – this lets you search the title, description, and captions all at once. The downside to this is it includes all fields, and there doesn’t appear to be a way to itemize fields in a <copyField>. Because of this, the terms “true” and “false” were the most common words in the index. This could be addressed by concatenating fields you want at index time. For simplicity I chose to generate a list of stopwords (427 terms) – this saved 1.2 GB.

If you want to analyze your own index:

  • The Luke endpoint is great for finding out which features are enabled for each field (stored, indexed, etc): http://18.204.10.30:8983/solr/talks/admin/luke?_=1541941486806&numTerms=0&wt=json
  • The admin page for each field can show you the top terms in the index, if applicable
  • Look at the sizes of files on disk – there are many types of file, and each represents a different type of information. Newer Lucene versions have more options, so it may be worth upgrading and reindexing.
  • The admin page for each field shows you how many documents have it for a value -if you find fields that are rarely used, they may be worth removing. If a field is used on every document, it’s worth tuning it’s storage.