GIN vs GiST For Faceted Search with Postgres Full Text Indexes

For this test, I set up a 1.1 million row table1 using data from github2.

If you’re not familiar with full-text in Postgres, they use the @@ operator to apply a query to a document3, which looks like this:

SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');

The order of the conditions doesn’t matter. Full-text indexes typically work by multiplying vectors, and Postgres provides two types of indexes to speed up this operation (although in my tests, they are not nearly as fast as a dedicated full-text system like Solr/Lucene).

This test shows an interesting case where Postgres will choose differently between GIN and GiST indexes. GiST indexes, in essence, are based on hash tables and GIN indexes are based on B-trees4. Supposedly GIN indexes take ~3x the time to build although for me they were close to the same, which suggests that my test was constrained by I/O read time on the source data. GIN indexes are also supposed to be ~3x faster to query, which I did experience.

As a benchmark, the slowest query does a full scan on the table and takes ~9-12 minutes (each index creation took a similar amount of time).

CREATE INDEX search_idx2 ON data2 USING gin(to_tsvector('english', SEARCH));
CREATE INDEX search_idx4 ON data2 USING gist(to_tsvector('english', SEARCH));
 
SELECT * 
FROM (
  SELECT author, COUNT(*) c
  FROM data2
  WHERE SEARCH @@ to_tsquery('linux')
  GROUP BY author
) counts 
ORDER BY c DESC;

Here is the execution plan for the above worst-case query:

Sort 
(cost=233587.16..233587.17 rows=6 width=22)
  Sort Key: (count(*))
    HashAggregate  
    (cost=233586.96..233587.02 rows=6 width=14)
      Seq Scan on data2 
     (cost=0.00..233578.12 rows=1768 width=14)
        Filter: 
         ( (search)::text 
         @@ to_tsquery('linux'::text) )

Applying the function used to generate the index causes Postgres to use the index, and cuts the time to ~23 seconds. For this query it choose the GIN index.

SELECT * 
FROM (
  SELECT author, COUNT(*) c
  FROM data2
  WHERE to_tsvector('english', SEARCH) @@ to_tsquery('linux')
  GROUP BY author
) counts 
ORDER BY c DESC;

Here is the plan (note search_idx2 is the GIN index). Interestingly we see an example of Postgres’s mysterious internal use of bitmap indexes, which can make things quite fast. Unlike Oracle you can’t create bitmap indexes directly – Oracle can use them aggressively if you buy the right license.

HashAggregate  
  (cost=74668.49..74669.28 rows=79 width=14) 
  (actual time=23726.071..23726.598 rows=1625 loops=1)
  Bitmap Heap Scan on data2  
    (cost=306.87..74535.03 rows=26692 width=14) 
    (actual time=15.051..23683.230 rows=27604 loops=1)

    Recheck Cond: (to_tsvector('english'::regconfig, 
                               (search)::text)
                  @@ to_tsquery('linux'::text))
    Rows Removed by Index Recheck: 42867
      Bitmap Index Scan on search_idx2  
      (cost=0.00..300.19 rows=26692 width=0) 
      (actual time=11.436..11.436 rows=27604 loops=1)
        Index Cond: (to_tsvector('english'::regconfig, 
                                (search)::text)
                    @@ to_tsquery('linux'::text))
Total runtime: 23727.012 ms

The next query runs several conjoined lookups with an AND, then chooses the GiST index.

~2 minutes on a 1.1 million row table

SELECT * 
FROM (
  SELECT author, COUNT(*) c
  FROM data2
  WHERE 
    to_tsvector('english', SEARCH) @@ to_tsquery('android')
    AND to_tsvector('english', SEARCH) @@ to_tsquery('ethernet')
    AND to_tsvector('english', SEARCH) @@ to_tsquery('linux')
  GROUP BY authors
) counts 
ORDER BY c DESC;

Note here that Postgres does combine all the conditions into one index lookup, instead of doing three and joining them.

Sort  
(cost=10.35..10.36 rows=1 width=22)
  Output: data2.author, (count(*))
  Sort Key: (count(*))
    HashAggregate  
    (cost=10.32..10.33 rows=1 width=14)
      Output: data2.author, count(*)
        Index Scan using search_idx4 on public.data2  
        (cost=0.01..10.32 rows=1 width=14)
          Output: 
            data2.author, data2.id, 
            data2.email, data2.company, 
            data2.date, data2.message, 
            data2.github, data2.search, 
            data2.name
          Index Cond: 
            ( (to_tsvector('english'::regconfig,
                           (data2.search)::text) 
            @@ to_tsquery('android'::text)) 
            AND (to_tsvector('english'::regconfig, 
                            (data2.search)::text) 
            @@ to_tsquery('ethernet'::text) ) 
            AND (to_tsvector('english'::regconfig, 
                            (data2.search)::text) 
            @@ to_tsquery('linux'::text) ) )

It’s not clear why Postgres chooses this index – if you disable the GiST index, the result switches back to the GIN index, and is nearly instant:

UPDATE pg_index SET indisvalid = FALSE 
WHERE indexrelid = 'search_idx4'::regclass;
 
SELECT * 
FROM (
  SELECT author, COUNT(*) c
  FROM data2
  WHERE 
    to_tsvector('english', SEARCH) @@ to_tsquery('android')
    AND to_tsvector('english', SEARCH) @@ to_tsquery('ethernet')
    AND to_tsvector('english', SEARCH) @@ to_tsquery('linux')
  GROUP BY author
) counts 
ORDER BY c DESC;

Here you can see the bitmap indexes again. This is the first query I’ve seen where Postgres comes near to Solr’s default performance (and I confirmed it’s not because Postgres cached results from prior queries, by restarting it).

Also note here that we can see the recheck condition the docs promise us occur sometimes with GIN indexes.

Sort 
(cost=68.09..68.09 rows=1 width=22)
  Sort Key: (count(*))
    HashAggregate  (cost=68.06..68.07 rows=1 width=14)
      Bitmap Heap Scan on data2  (cost=64.02..68.05 rows=1 width=14)
        Recheck Cond: 
          ( (to_tsvector('english'::regconfig, (search)::text) 
          @@ to_tsquery('android'::text) ) 
          AND (to_tsvector('english'::regconfig, (search)::text) 
          @@ to_tsquery('ethernet'::text) ) 
          AND (to_tsvector('english'::regconfig, (search)::text) 
          @@ to_tsquery('linux'::text) ) )
        Bitmap Index Scan on search_idx2  
          (cost=0.00..64.02 rows=1 width=0)
          Index Cond: 
            ( (to_tsvector('english'::regconfig, (search)::text) 
            @@ to_tsquery('android'::text)) 
            AND (to_tsvector('english'::regconfig, (search)::text) 
            @@ to_tsquery('ethernet'::text)) 
            AND (to_tsvector('english'::regconfig, (search)::text) 
            @@ to_tsquery('linux'::text)))

What this seems to show is that if you’re concerned about query performance, you want to use GIN indexes over GiST wherever index creation time supports it. That said, Solr may be a better option still – equivalent queries take < 1.5 seconds on a default Solr install, with no time spent tinkering with tuning.

Citations:
  1. http://www.garysieling.com/blog/importing-data-from-solr-to-postgres-with-scala []
  2. http://www.garysieling.com/blog/converting-git-commit-history-to-a-solr-full-text-index []
  3. http://www.postgresql.org/docs/9.2/static/textsearch-intro.html []
  4. http://www.postgresql.org/docs/9.1/static/textsearch-indexes.html []

Tags: , , ,

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment

Current ye@r *