Import Google Ngrams data into Postgres

Google Ngrams is a dataset of word frequencies over time, based on Google books¹. There are files for different numbers of words (1, 2, 3, 4, and 5 word combinations). Unzipped, the 1 word files are just under 30 GB, which is large, but manageable on a developer workstation.

The files are partitioned by the first character of the word, and are grouped by the year of publication. From this, you get the number of books that use a word in a given year, and the number of times the word is used overall. Note these can be quite large numbers, so you need bigint.

I think it makes sense to import each file into it’s own table, because then you can choose to query them separately to test the performance of what you want to do.

create table ngrams_0 (
  ngram varchar, 
  year int, 
  match_count bigint, 
  volume_count bigint
);

create index on ngrams_0 (
  ngram, 
  volume_count
);

create table ngrams_1 (
  like ngrams_0 
  including indexes
);

Postgres has a built-in CSV importer, which seems to be quite fast.

psql -c "\COPY ngrams_o FROM 'z:\googlebooks-eng-all-1gram-20120701-o' DELIMITER '\t' ENCODING 'utf-8'"

It doesn’t do any error handling – it just stops if it doesn’t know what to do. I explored using pgloader or a multicorn based foreign data wrapper (file_fdw is a COPY wrapper), as an alternative, but found these were more difficult to configure on Windows than I wanted.

Rather, I found that there was only one error condition preventing COPY from work, which was slashes in the dataset. I didn’t really want these anyway, so I removed these rows ahead of time. This technique also works well to pre-filter the files to a single year, which makes prototyping a lot easier.

for f in googlebooks*
do
  echo filtered.$f
  grep -v '\\' $f > filtered.$f
done

Once you do this, you may want a view across all the tables:

create table all_ngrams as 
select * from (
  select * from ngrams_0 union all
  select * from ngrams_1 union all
  ...
) all

Once you have this, you can do interesting queries, like TF-IDF²:

with words as (
  select doc_id, word, cnt, sum(cnt) over(partition by doc_id) wc
  from (
    select 'error' as word, 10 as cnt, 1 as doc_id
    union all 
    select 'an' as word, 5 as cnt, 1 as doc_id
    union all 
    select 'occurred' as word, 7 as cnt, 1 as doc_id
  ) wc
)
select *, sum(tf_idf) over () score
from (
  select doc_id, word, cnt, tf, volume_count, idf, tf*idf tf_idf 
  from (
    select 
      doc_id,
      words.word, 
      words.cnt,
      words.cnt::float / words.wc tf, 
      aw.volume_count, 
      ln(206272::float / aw.volume_count) idf -- max(volume_count)
    from words 
    left join all_ngrams aw on aw.ngram = words.word
  ) scores
) total_score;

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html [↩]
http://www.tfidf.com/
[↩]

Leave a Reply Cancel reply