Computing Absolute Values in RethinkDB

It took me a bit to figure out how to do absolute values in RethinkDB, because you can’t use conditionals and mathematical operations that you might normally expect.

I expect that abs will be added to the language relatively soon, based on the outstanding tickets, but if you need it now, here is how to implement it:

  'aDelta': 
     r.branch(
        row('Delta').gt(0), 
        row('Delta'), 
        row('Delta').mul(0)
  ),

The following query is more complete if you want to see an example in action – this computes root-mean-square variance as well (unfortunately there does not appear to be a power function in the language as yet, so you can’t do the square root at the end that you’d want)

r.database('performance')
 .table('query_timings)
 .map(function(row) {
       return {
         'Delta': row('Delta'),
         'aDelta': 
            r.branch(
              row('Delta').gt(0), 
              row('Delta'), 
              row('Delta').mul(0)
           ),
         'rmsDelta': row('Delta').mul(row('Delta')),
         'Count': 1
       }
     })
    .group('Count')
    .reduce(function(left, right) {
      var Delta = left('Delta').add(right('Delta'));
      var aDelta = left('aDelta').add(right('aDelta'));
      var rmsDelta = left('rmsDelta').add(right('rmsDelta'));
      var Count = left('Count').add(right('Count'));    
 
      return {
        Delta: Delta,
        Count: Count,
        aDelta: aDelta,
        rmsDelta: rmsDelta,
        Variance: rmsDelta.exp(0.5),
        Avg: Delta.div(Count)
      }
    })

Using multiple aggregates (SUM, COUNT, AVG) in a RethinkDB query

With a RethinkDB query, you can’t do “count” and “sum” quite the same as you’d do in SQL.

Since it uses a map-reduce style engine underneath, you set count to “1” on each of your rows and then add these up later:

r.db('performance')
 .table('query_timings')
 .pluck('DbId')
 .group('DbId')
 .map(function(row) {
  return {
    DbId: row('DbId'),
    Duration: row('Duration'),
    count: 1
  }
  })
  .reduce(function(left, right) {
    var dur = left('Duration').add(right('Duration'));
    var cnt = left('count').add(right('count'));    
 
    var avg = cnt > 0 ? dur / avg : 0;
 
    return {
      Duration: dur,
      count: cnt,
      avg: avg
    }
  })

Note however, that with this implementation you’re computing the average over and over, which is not ideal.

If you want to do “count distinct” you’d want to create a hashmap to track the distinct values:

r.db('performance')
 .table('query_timings')
 .group('DbId')
 .map(function(row) {
    var IP = row('IP');
 
    var o = r.object(IP, 1);
 
    return {
      DbId: row('DbId'),
      IP: IP,
      o
    }
  })
  .reduce(function(left, right) {
    var qL = left('o');
    var qR = right('o');
 
    var m = qL.merge(qR);
 
    return {
      DbId: left('DbId'),
      IPs: m.keys(),
      found: m.keys().count(),
      o: m
    }
  })
  .ungroup()
  .map(function(group) {
    return {
        'DbId': group('group'),
        'found': group('reduction')('found')
    }
  })

RethinkDB: Truncate table

Like most relational databases, RethinkDB has several different commands to delete records from a database.

To test the behavior, I made several copies of an existing table I had, which took ~1 minute per:

r.db('performance')
 .tableCreate('query_timings2')
 
r.db('performance')
 .table('query_timings2')
 .insert(r.db('performance')
 .table('query_timings'))

Currently, the SQL to ReQL comparision guide lists this as the equivalent of “truncate”:

r.db('performance')
 .table('query_timings')
 .delete()

In my database, this took about the same amount of time as creating the original table. I believe this runs in a transaction, as when I abort the query the number of rows remains the same.

In many relational databases, “truncate” has a specific meaning; in Oracle it deletes the underlying filesystem from disk immediately and outside a transaction. In Postgres it runs within a transaction but skips any effort required to verify rows, which appears to match what an unqualified delete does in RethinkDB.

If you would like to replicate the Oracle behavior, dropping and re-creating a table seems to be much faster (I only know this experimentally, so I don’t know if this runs outside a transaction)

r.db('performance')
 .tableDrop('query_timings') -- 45 ms
 
r.db('performance')
 .tableCreate('query_timings')

RethinkDB has two other neat feature that I’ve not seen in other databases; the option to do a “hard” vs “soft” delete, and whether to return the deleted rows:

r.db('performance')
 .table('query_timings2')
 .delete({durability="soft", returnChanges=false})

In my test, the “soft” delete took about half the time of the “hard” delete (basically it is not waiting for the OS to confirm changes were written to disk).

Writing a “count distinct” query in RethinkDB

Writing queries to compute aggregates in RethinkDB is fairly simple; not so for “count distinct” style queries.

For this example we want to write a ReQL query that looks like this:

SELECT COUNT(DISTINCT IP), DbID FROM logs;

The first step is to group the data by your aggregation field, and add a tracking object. We’ll use this map for counting later. Note here that it is tempted to declare this as a nested object inside the return – for me this did not work.

In addition, RethinkDB internally creates it’s own Javascript objects, which is what allows you to execute a series of functions on them. You can’t do “var o = {}”, or it will fail later.

r.db('performance')
 .table('query_timings')
 .group('DbId')
 .map(function(row) {
    var IP = row('IP');
 
    var o = r.object(IP, 1);
 
    return {
      DbId: row('DbId'),
      IP: IP,
      o
    }
  })

To do the counting, we find our previous counting map and merge it (duplicate keys get overridden – that’s where the distinct comes in). Once complete, you can also return the list of distinct values if you like. Again, resist the temptation to think that because this is Javascript you can use existing Javascript code to merge objects (e.g. extend/apply or the “hasOwnProperty” method) – these will not work.

  .reduce(function(left, right) {
    var qL = left('o');
    var qR = right('o');
 
    var m = qL.merge(qR);
 
    return {
      DbId: left('DbId'),
      IPs: m.keys(),
      found: m.keys().count(),
      o: m
    }
  })

Once you have this done, you’ll have objects with two keys, “group” and “reduction”. You can fix this by ungrouping the results like so:

  .ungroup()
  .map(function(group) {
    return {
        'DbId': group('group'),
        'found': group('reduction')('found')
    }
  })

And after a lot of work, you get your data back:

[
  {
    "DbId":  "db027ceeffa00749a2b9e6a11f0003c13c" ,
    "found": 2
  }
]

RethinkDB WHERE clause examples

RethinkDB calls a WHERE clause “filter”, and you can write it in several forms.

One is the “equals” form:

r.db('performance')
 .table('query_timings')
 .filter({ 'Duration': 0 })

If you can actually use this form, you can put many filters in the one function call, although they are all equality comparisons, so they aren’t that useful:

r.db('performance')
 .table('query_timings')
 .filter({ 'Duration': 0, 'Day': 0 })

According to the documentation, you can use an index if you completely rewrite this form:

r.table("performance").index_create("Duration").run(conn)
 
r.table("users")
 .get_all("0", index="Duration")

The more useful form of a filter is to just write a function. However, beware that you have to use functions like “ne” for not equals (and eq, add, etc), as they aren’t able to send Javascript operators across.

r.db('performance')
 .table('query_timings')
 .filter(function(row) { return row('Duration').ne(null) })

Alternately, if you absolutely must use normal Javascript syntax, or are writing an Orm, you can pass the Javascript as a string, which is eval’ed on the server side. Comparing this with alternate implementations would be an interesting possibility here, to test the performance of various implementations of the same query.

r.db('performance')
 .table('query_timings')
  .filter(r.js('(function (row) { return row.Duration != null; })'))

They do offer one final shortcut method, which is to use “match”, which presumably returns a lamdba (note this uses the somewhat unexpected “r.row” syntax, so this is operating as some sort of static method):

r.db('performance')
 .table('query_timings').filter(
   r.row['Query'].match("^test.*")}
 )

This is likely better for performance than building you’re own regex in a function, as (hopefully) this is compiled prior to creating the lambda, rather than on each call to the function.

Liquibase autoIncrement startWith example

Liquibase doesn’t seem to have a command line option to let you set the start points on autoIncrement fields when you generate patches, but you can fix this easily by adding “startWith” after the fact (useful for a database migration):

<changeSet author="gary" id="1437853066645-52">
  <createTable tableName="messages">
    <column autoIncrement="true" startWith="150000" name="Id" type="bigint">
      <constraints primaryKey="true" primaryKeyName="pk_messages">
    </column>
  </createTable>
</changeSet>

Finding the number of rows of all tables in a SQL Server database

The following query finds all tables in a schema, ordered by row count:

SELECT
    sysobjects.*
    , sysindexes.ROWS
FROM
    sysobjects
    INNER JOIN sysindexes
    ON sysobjects.id = sysindexes.id
WHERE
    TYPE = 'U'
    AND sysindexes.IndId < 2
ORDER BY sysindexes.ROWS DESC

Comparing two SQL Server databases with Liquibase

Liquibase is a tool for managing database schemas (e.g. diffing schemas and writing migration scripts), and supports most major commercial and open-source relational databases. I’ll show how to use it to compare a migrated database to the original database, to find and fix discrepancies.

To use it with SQL Server, you’ll first need to download Liquibase and the Sql Server JDBC Driver.

Once you’ve done this, you can make a file called “liquibase.properties” where you’re running it – this avoids long command line arguments, and it’s really convenient when you switch from diffing to applying updates, since you don’t need to make any changes to the file later in the process.

Here’s my example properties file – obviously you’ll need to change the paths to fit your environment:

classpath=D:\\Software\\SQL Server JDBC\\sqljdbc_4.0\\enu\\sqljdbc4.jar
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://new host:1433;databaseName=new database name
username=new server user
password=new server password
changeLogFile=ChangeLog.xml

referenceUrl=jdbc:sqlserver://old server host;databaseName=old database name
referenceUsername=old database username
referencePassword=old database password

Once you do this, you can run:

liquibase diff

This prints out a nice report, that tells you what problems it found. One of the risks of using such a tool is that it may not detect some class of difference that is material to your application, but this appears to be fairly thorough.

The nice thing about running a report is that it gives you an early read on how long this is going to take in the future when you run the migration for real (minutes, hours, days).

Unexpected Index(s): NONE
Changed Index(s): NONE
Missing Primary Key(s):
     PK_ABExperiments_1 on ABExperiments(UniqueId, Experiment)
     PK_AccountCredits on AccountCredits(AccountId)
Unexpected Primary Key(s): NONE
Changed Primary Key(s): NONE
Missing Schema(s): NONE
Unexpected Schema(s): NONE
Changed Schema(s): NONE
Missing Sequence(s): NONE
Unexpected Sequence(s): NONE
Changed Sequence(s): NONE
Missing Stored Procedure(s): NONE
Unexpected Stored Procedure(s): NONE
Changed Stored Procedure(s): NONE
Missing Table(s): NONE
Unexpected Table(s): NONE
Changed Table(s): NONE
Missing Unique Constraint(s): NONE
Unexpected Unique Constraint(s): NONE
Changed Unique Constraint(s): NONE

The previous step was optional. To get the real diff, the following command will make an XML document containing every difference:

liquibase diffChangeLog

Once you have this, you can simply run the update. However, before running the update, I removed the reference database from the properties file as a “belts and suspenders” tactic, although I’m sure it’s safe anyway.

liquibase update

Then you will get a series of failures (and I never promised this would be easy). Fortunately, it rolls back each time. You have to decide how to fix these in your case – while some of them may be defects in Liquibase, you’re more likely to hit problems with the database itself.

For instance, for me the migration I ran to build the new database didn’t bring across auto autoincrement fields, so I get these patches, which fail:

<addAutoIncrement columnDataType="int(10)" columnName="Id" tableName="Accounts"/>

The problem here is two-fold; Liquibase would need to know how to restart the auto-increment field, and even if that was solved, SQL Server doesn’t let you change something to an auto-increment. Dropping the column and re-adding it as an auto-increment is a poor option as well, because the table will have the columns out of order, potentially breaking code that relies on “SELECT *”. This alone indicates that I’ll need to re-migrate this database.

Fortunately each update is a single line, so you can (and should) delete the problem updates from the patch XML and re-run it- most likely you’ll hit several problems, but liquibase stops after hitting one category of problem, so it’s helpful to know how many failures you’re going to get.

I’m not sure why this happens, but Liquibase tries to set default values on my boolean columns, which fails:

Default value of 0 does not match defined type of boolean, D:/Software
/liquibase-3.4.0-bin/ChangeLog.xml::1437835822256-139::gary (generated)

Liquibase also tried to re-create all my primary keys. This may be the difference between an “identity autoincrement” column, or just that the original migration changed the names of the primary keys. Fortunately this I can just remove entirely.

Unexpected error running Liquibase: The object 'PK_Accounts' is dependent on col
umn 'Id'. [Failed SQL: ALTER TABLE [dbo].[Accounts] ALTER COLUMN [Id] [int]]

Liquibase also attempts to changing column types on many columns, but this fails because keys were already migrated:

<changeSet author="gary (generated)" id="1437835822256-137">
  >modifyDataType columnName="Id" newDataType="bigint(19)" tableName="lcsk_Messages"/>
</changeSet>

This may be an indication that Liquibase would be better to start the migration, then use a separate tool to migrate data.

RethinkDB “IS NULL” in a where clause filter

I find the syntax for RethinkDB filtering a little unintuitive – if this was actually Javascript, you might expect to see “row[‘Query’]”. I expect this allows them to instrument the queries more easily, although considering they are supposedly translating this into an intermediate language, this doesn’t seem necessary.

To do equality, non-equality, and so on, there are several functions available (ne / eq / gt / lt). I suspect that the reason for this is that the getter function for a column always returns an object – so it can’t be “equal” to null. This is likely a Builder pattern that lets them lazy-execute results; if this database goes the direction of Oracle’s optimizer it will also provide opportunities for query rewriting.

The key is, using Javascript operators will be syntactically correct, but will not work. Unfortunately the RethinkDB editor does not warn you about this, or auto-correct them.

r.db('performance')
.table('query_timings')
.filter(
  function(d){ 
    return d('Query').ne(null) 
  })

Hopefully in the future we’ll also see ES6 functions, to make this more compact.

Executing a “GROUP BY” in RethinkDB

Because RethinkDB uses a map-reduce architecture underneath, the syntax for ‘group by’ is a little different than you might expect compared to SQL.

r.db('performance')
 .table('query_timings')
 .pluck('Query', 'Duration')
 .group('Query')
 .sum('Duration')

The interesting effect of this is that even though you specify only ‘Duration’ in the sum, you get the grouping fields back as well. If you remove ‘sum’, you’ll get all the rows in each group in an array, which I imagine would be pretty handy for a lot of React-based applications.