Importing OpenLibrary data into RethinkDB

You can download lists of books and authors from the Open Library¹. The data dumps are formatted simply, as a tab separated file, with the last column being JSON data.

type - type of record (/type/edition, /type/work etc.)
key - unique key of the record. (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format

To do this, you need to split the line, and parse the JSON data.

import * as r from 'rethinkdb';

function parseLine(line: string) {
  const columns = line.split("\t");
  return {
    type: columns[0],
    data: columns[4]
  }
}

function onError(err, result) {
  if (err) throw err;
}

function run() {
  const fs = require('fs'),
        byline = require('byline');

  r.connect( {host: 'localhost', port: 28015}, 
    function(err, conn) {
      if (err) throw err;

      const stream = byline(
        fs.createReadStream('works.txt', { encoding: 'utf8' })
      );
      var options = {};

      options['durability'] = 'soft';
      options['returnChanges'] = false;

      var lines = [];
      function onRowComplete(line) {
        r.db('openlibrary').table('works').insert(
          JSON.parse(parseLine(line).data),
          options
        ).run(conn, onError)

        lines = [];
      }

      stream.on('data', onRowComplete);
    })
}

The data files are quite large, so you need to be careful about memory allocation – make sure to not allocate too many objects in the process. The options set on the import make it run much faster.

When you run it, you’ll also need to use memory settings for Node:

node --max_old_space_size=2000000 dist/run.js

https://openlibrary.org/developers/dumps [↩]

Leave a Reply Cancel reply