Gary Sieling

Tuning Solr (Lucene) Disk Usage

The initial index for FindLectures.com was huge – 9.1 GB for around 210,000 videos. Most Solr hosting services would charge $150-$300 / month for this size index. For some time I used BizSpark on Azure, which gives you $150 / month VM. When that ran out, I switched to a t2.medium VM on AWS, but this costs $50 / month.

This index stores video titles, lengths, descriptions, closed captions, some facets (topic, year), and a quality score.

I made several changes, which reduced disk usage from 9.1 GB to 210 MB.

If you want to analyze your own index:

  • The Luke endpoint is great for finding out which features are enabled for each field (stored, indexed, etc): http://18.204.10.30:8983/solr/talks/admin/luke?_=1541941486806&numTerms=0&wt=json
  • The admin page for each field can show you the top terms in the index, if applicable
  • Look at the sizes of files on disk – there are many types of file, and each represents a different type of information. Newer Lucene versions have more options, so it may be worth upgrading and reindexing.
  • The admin page for each field shows you how many documents have it for a value -if you find fields that are rarely used, they may be worth removing. If a field is used on every document, it’s worth tuning it’s storage.
  • Exit mobile version