Querying Multiple Postgres Full-Text Indexes

Setup

I’ve set up several scenarios to see how Postgres handles querying multiple full-text indexes.

To demonstrate these scenarios I set up the following table and indexes, which has 1.1 million records:

CREATE TABLE data2
(
  author character varying,
  id character varying,
  email character varying,
  company character varying,
  date character varying,
  message character varying,
  github character varying,
  search character varying,
  name character varying
)

CREATE INDEX author_idx ON data2 USING gin(to_tsvector('english', author));
CREATE INDEX message_idx ON data2 USING gin(to_tsvector('english', message));
CREATE INDEX company_idx ON data2 USING gin(to_tsvector('english', company));

Querying across columns

You can easily query across columns, or combine multiple queries.

SELECT * 
FROM data2
WHERE 
  to_tsvector('english', message) @@ to_tsquery('android')
  and to_tsvector('english', message) @@ to_tsquery('merge')
  and to_tsvector('english', author) @@ to_tsquery('mike')
  and to_tsvector('english', company) @@ to_tsquery('google')

This structure triggers the use of a Bitmap indexes. I’ve been investigating whether it is possible to combine the index for several columns into one, but this presents an interesting case in favor of separate indexes.

When this is used to query multiple columns, it’s super-fast- bitmap operations generate lists of bits, which are combined at a low-level in large blocks with AND and OR, in the same format as the AND and OR in the query.

Bitmap Heap Scan on data2  
(cost=138.99..143.03 rows=1 width=881)"
  Recheck Cond: 
  ( (to_tsvector('english'::regconfig, (message)::text) 
   @@ to_tsquery('android'::text)) 
   AND (to_tsvector('english'::regconfig, (message)::text) 
   @@ to_tsquery('merge'::text)) 
   AND (to_tsvector('english'::regconfig, (company)::text) 
   @@ to_tsquery('google'::text)))
  Filter:
    (to_tsvector('english'::regconfig, (author)::text) 
    @@ to_tsquery('mike'::text))
  BitmapAnd  
  (cost=138.99..138.99 rows=1 width=0)
    Bitmap Index Scan on message_idx  
    (cost=0.00..36.45 rows=44 width=0)
      Index Cond: 
        ( (to_tsvector('english'::regconfig, (message)::text) 
        @@ to_tsquery('android'::text)) 
        AND (to_tsvector('english'::regconfig, (message)::text) 
        @@ to_tsquery('merge'::text)))
    Bitmap Index Scan on company_idx  
    (cost=0.00..102.29 rows=8838 width=0)
      Index Cond: 
        (to_tsvector('english'::regconfig, (company)::text)
        @@ to_tsquery('google'::text))

Similar to the last query, we can see that ORs also use bitmap operations:

SELECT * 
FROM data2
WHERE 
  to_tsvector('english', message) @@ to_tsquery('android')
  and to_tsvector('english', message) @@ to_tsquery('merge')
  and to_tsvector('english', author) @@ to_tsquery('mike')
  and to_tsvector('english', company) @@ to_tsquery('google')

Full Text and Joins

The next question is how this works when the columns queried are in separate tables – clearly this is harder to use with bitmap indexing, because for bitmap logical operators to work, the rows must be lined up between the two tables to execute the join. In Oracle, a scenario resembling this is actually possible with the star transform, but in Postgres our options are more limited.

To support this, we create two new tables, using the github URL as a join key:

create table author_github as 
select distinct author, github
from data2;

create table company_github as 
select distinct company, github
from data2;

create index author_github_github_idx on author_github(github);
create index company_github_github_idx on company_github(github);

create index author_github_author_idx ON author_github USING gin(to_tsvector('english', author));
create index company_github_company_idx ON company_github USING gin(to_tsvector('english', company));

Here, Postgres does all the work for each table, sorts the results, then does a merge join (a zipper algorithm). This isn’t quite as compelling as the bitmap combinations, but still effective, since the tables being queried are much smaller to start.

select author, company
from author_github ag
join company_github cg on ag.github = cg.github
where to_tsvector('english', author) @@ to_tsquery('linus')
  and to_tsvector('english', company) @@ to_tsquery('google')

Note that the indexes on the IDs are not used at all for this query.

Merge Join  
(cost=662.89..701.81 rows=2553 width=22)
  Merge Cond: 
  ( (cg.github)::text = (ag.github)::text)
    Sort  
    (cost=212.16..212.48 rows=126 width=36)
      Sort Key: cg.github
        Bitmap Heap Scan on company_github cg  
        (cost=12.98..207.77 rows=126 width=36)
          Recheck Cond: 
          (to_tsvector('english'::regconfig, (company)::text) 
          @@ to_tsquery('google'::text))
            Bitmap Index Scan on company_github_company_idx  
            (cost=0.00..12.95 rows=126 width=0)
             Index Cond: 
               (to_tsvector('english'::regconfig, (company)::text) 
               @@ to_tsquery('google'::text))
  Sort  
  (cost=450.72..451.38 rows=261 width=43)
    Sort Key: ag.github
      Bitmap Heap Scan on author_github ag  
      (cost=14.03..440.25 rows=261 width=43)
        Recheck Cond: 
        (to_tsvector('english'::regconfig, (author)::text) 
        @@ to_tsquery('linus'::text))
          Bitmap Index Scan on author_github_author_idx  
          (cost=0.00..13.96 rows=261 width=0)
            Index Cond: 
              (to_tsvector('english'::regconfig, (author)::text) 
              @@ to_tsquery('linus'::text))

Querying across columns through views

As one final test, lets prove that this still works when the query is against a view. Each column is renamed, just to show the functionality.

create view merge_commits as 
select author a, id i, company c, email e, date d
from data2
where to_tsvector('english', message) @@ to_tsquery('merge')

Within the view we have one full-text condition, then the query against the view adds a second:

select * from merge_commits
where to_tsvector('english', a) @@ to_tsquery('mike')

Like the original query, this generates Bitmap Ands, as expected.

Bitmap Heap Scan on data2  
(cost=216.86..392.01 rows=44 width=163)
  Recheck Cond: 
  ( (to_tsvector('english'::regconfig, (author)::text) 
  @@ to_tsquery('mike'::text)) 
  AND (to_tsvector('english'::regconfig, (message)::text) 
  @@ to_tsquery('merge'::text)))
    BitmapAnd
    (cost=216.86..216.86 rows=44 width=0)
      Bitmap Index Scan on author_idx
      (cost=0.00..106.29 rows=8838 width=0)
        Index Cond: 
        (to_tsvector('english'::regconfig, (author)::text) 
         @@ to_tsquery('mike'::text))
      Bitmap Index Scan on message_idx  
        (cost=0.00..110.29 rows=8838 width=0)
        Index Cond: 
          (to_tsvector('english'::regconfig, (message)::text) 
          @@ to_tsquery('merge'::text))

Once these tables are set up correctly, Postgres seems to be able to handle a variety of scenarios, the best being multiple filters on a single table.

Leave a Reply

Your email address will not be published. Required fields are marked *