{"id":4750,"date":"2016-07-23T23:45:36","date_gmt":"2016-07-23T23:45:36","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=4750"},"modified":"2016-07-23T23:45:36","modified_gmt":"2016-07-23T23:45:36","slug":"importing-openlibrary-data-rethinkdb","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/importing-openlibrary-data-rethinkdb\/","title":{"rendered":"Importing OpenLibrary data into RethinkDB"},"content":{"rendered":"<p>You can download lists of books and authors from the Open Library<sup><a href=\"#footnote_0_4750\" id=\"identifier_0_4750\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/openlibrary.org\/developers\/dumps\">1<\/a><\/sup>. The data dumps are formatted simply, as a tab separated file, with the last column being JSON data.<\/p>\n<pre>\ntype - type of record (\/type\/edition, \/type\/work etc.)\nkey - unique key of the record. (\/books\/OL1M etc.)\nrevision - revision number of the record\nlast_modified - last modified timestamp\nJSON - the complete record in JSON format\n<\/pre>\n<p>To do this, you need to split the line, and parse the JSON data. <\/p>\n<pre lang=\"javascript\">\nimport * as r from 'rethinkdb';\n\nfunction parseLine(line: string) {\n  const columns = line.split(\"\\t\");\n  return {\n    type: columns[0],\n    data: columns[4]\n  }\n}\n\nfunction onError(err, result) {\n  if (err) throw err;\n}\n\nfunction run() {\n  const fs = require('fs'),\n        byline = require('byline');\n\n  r.connect( {host: 'localhost', port: 28015}, \n    function(err, conn) {\n      if (err) throw err;\n\n      const stream = byline(\n        fs.createReadStream('works.txt', { encoding: 'utf8' })\n      );\n      var options = {};\n\n      options['durability'] = 'soft';\n      options['returnChanges'] = false;\n\n      var lines = [];\n      function onRowComplete(line) {\n        r.db('openlibrary').table('works').insert(\n          JSON.parse(parseLine(line).data),\n          options\n        ).run(conn, onError)\n\n        lines = [];\n      }\n\n      stream.on('data', onRowComplete);\n    })\n}\n<\/pre>\n<p>The data files are quite large, so you need to be careful about memory allocation &#8211; make sure to not allocate too many objects in the process. The options set on the import make it run much faster.<\/p>\n<p>When you run it, you&#8217;ll also need to use memory settings for Node:<\/p>\n<pre lang=\"bash\">\nnode --max_old_space_size=2000000 dist\/run.js\n<\/pre>\n<ol class=\"footnotes\"><li id=\"footnote_0_4750\" class=\"footnote\">https:\/\/openlibrary.org\/developers\/dumps<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_0_4750\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>Importing Open Library data into RethinkDB<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[12],"tags":[403,466],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4750"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=4750"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4750\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=4750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=4750"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=4750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}