{"id":4833,"date":"2016-08-05T12:28:44","date_gmt":"2016-08-05T12:28:44","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=4833"},"modified":"2016-08-05T12:28:44","modified_gmt":"2016-08-05T12:28:44","slug":"import-google-ngrams-data-postgres","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/import-google-ngrams-data-postgres\/","title":{"rendered":"Import Google Ngrams data into Postgres"},"content":{"rendered":"<p>Google Ngrams is a dataset of word frequencies over time, based on Google books<sup><a href=\"#footnote_0_4833\" id=\"identifier_0_4833\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/storage.googleapis.com\/books\/ngrams\/books\/datasetsv2.html\">1<\/a><\/sup>. There are files for different numbers of words (1, 2, 3, 4, and 5 word combinations). Unzipped, the 1 word files are just under 30 GB, which is large, but manageable on a developer workstation.<\/p>\n<p>The files are partitioned by the first character of the word, and are grouped by the year of publication. From this, you get the number of books that use a word in a given year, and the number of times the word is used overall. Note these can be quite large numbers, so you need bigint.<\/p>\n<p>I think it makes sense to import each file into it&#8217;s own table, because then you can choose to query them separately to test the performance of what you want to do.<\/p>\n<pre lang=\"sql\">\ncreate table ngrams_0 (\n  ngram varchar, \n  year int, \n  match_count bigint, \n  volume_count bigint\n);\n\ncreate index on ngrams_0 (\n  ngram, \n  volume_count\n);\n\ncreate table ngrams_1 (\n  like ngrams_0 \n  including indexes\n);\n<\/pre>\n<p>Postgres has a built-in CSV importer, which seems to be quite fast. <\/p>\n<pre lang=\"bash\">\npsql -c \"\\COPY ngrams_o FROM 'z:\\googlebooks-eng-all-1gram-20120701-o' DELIMITER '\\t' ENCODING 'utf-8'\"\n<\/pre>\n<p>It doesn&#8217;t do any error handling &#8211; it just stops if it doesn&#8217;t know what to do. I explored using pgloader or a multicorn based foreign data wrapper (file_fdw is a COPY wrapper), as an alternative, but found these were more difficult to configure on Windows than I wanted.<\/p>\n<p>Rather, I found that there was only one error condition preventing COPY from work, which was slashes in the dataset. I didn&#8217;t really want these anyway, so I removed these rows ahead of time. This technique also works well to pre-filter the files to a single year, which makes prototyping a lot easier. <\/p>\n<pre lang=\"bash\">\nfor f in googlebooks*\ndo\n  echo filtered.$f\n  grep -v '\\\\' $f > filtered.$f\ndone\n<\/pre>\n<p>Once you do this, you may want a view across all the tables:<\/p>\n<pre lang=\"sql\">\ncreate table all_ngrams as \nselect * from (\n  select * from ngrams_0 union all\n  select * from ngrams_1 union all\n  ...\n) all\n<\/pre>\n<p>Once you have this, you can do interesting queries, like TF-IDF<sup><a href=\"#footnote_1_4833\" id=\"identifier_1_4833\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/www.tfidf.com\/\">2<\/a><\/sup>:<\/p>\n<pre lang=\"sql\">\nwith words as (\n  select doc_id, word, cnt, sum(cnt) over(partition by doc_id) wc\n  from (\n    select 'error' as word, 10 as cnt, 1 as doc_id\n    union all \n    select 'an' as word, 5 as cnt, 1 as doc_id\n    union all \n    select 'occurred' as word, 7 as cnt, 1 as doc_id\n  ) wc\n)\nselect *, sum(tf_idf) over () score\nfrom (\n  select doc_id, word, cnt, tf, volume_count, idf, tf*idf tf_idf \n  from (\n    select \n      doc_id,\n      words.word, \n      words.cnt,\n      words.cnt::float \/ words.wc tf, \n      aw.volume_count, \n      ln(206272::float \/ aw.volume_count) idf -- max(volume_count)\n    from words \n    left join all_ngrams aw on aw.ngram = words.word\n  ) scores\n) total_score;\n<\/pre>\n<ol class=\"footnotes\"><li id=\"footnote_0_4833\" class=\"footnote\">http:\/\/storage.googleapis.com\/books\/ngrams\/books\/datasetsv2.html<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_0_4833\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_1_4833\" class=\"footnote\">http:\/\/www.tfidf.com\/<br \/>\n<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_1_4833\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>How to import Google ngrams data into Postgres, so that you can query it<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[12],"tags":[437],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4833"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=4833"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4833\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=4833"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=4833"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=4833"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}