Validating Application Performance in a cloud environment, using C#, Hangfire, and RethinkDB

The rise of “platform” sites (e.g. Heroku) enables developers to build and deploy web applications cheaply, without understanding operational problems. Typically these products let you purchase a combination of three things: web servers, databases, and background job execution. This typically works well, as long as you stay within their platform, and as long as their pricings maps to what you are doing.

There are several “platform” vendors that are essentially AWS resellers, so you should be able to mix-and-match application services. For instance, if you wanted to upgrade a database, you can purchase a database from a company like Compose.io or from Amazon directly via RDS.

To test databases sold by different vendors, I built modified implementations of the C# database drivers which run the queries against multiple databases in parallel. These buffer the results, and log them to RethinkDB in a background thread using Hangfire. When you want a database connection, you do this:

connection = 
  new ProxyDbConnection(
    new List() {
      new SqlConnection(ConfigurationManager.ConnectionStrings["db1"].ToString()),
      new SqlConnection(ConfigurationManager.ConnectionStrings["db2"].ToString())
  });

Underneath, these log what you do to them, which lets us compare performance:

The log messages record everything we know about a transaction – the query, a hash of it’s parameters, the database, and an ID that links all of the concurrent operations into one group:

public class TimingRecord
{
  public String RunId { get; set; }
  public String DbId { get; set; }
  public DateTime RunDate { get; set; }
  public double? Duration { get; set; }
  public String Query { get; set; }
  public String QueryParms { get; set; }
  public String IP { get; set; }
  public String ErrorMessage { get; set; }
}

When a query completes or errors out, these messages are saved to a thread-safe collection. If the collection reaches some size, it is serialized and saved to RethinkDB in the background If RethinkDB shuts down or is not configured correctly, this behavior is no worse than it was prior to the introduction of this logging.

For background threads, I’m using Hangfire, which is nice in that you can make any public method with serializable arguments into a background process.

if (_buffer.Count >= 50)
{
   ConcurrentQueue dataToSend;
   lock (_buffer)
   {
     dataToSend = _buffer;
    _buffer = new ConcurrentQueue();
   }

   BackgroundJob.Enqueue(() => SaveTimingLog(dataToSend.ToArray()));            
}

Because RethinkDB lets you upload any object, saving the log messsages is trivial:

var ConnectionFactory = ConfigurationAssembler.CreateConnectionFactory("logging");
   
using (var connection = ConnectionFactory.Get())
{
  var table = Query.Db("performance").Table("query_timings");
  connection.Run(table.Insert(timings));
  connection.Dispose();
}            

When you run a query using the proxy database connection, it runs against each of the provided connections:

var queryResults =
  Parallel.ForEach(
    _commands,
		(_command) => {
      ..
    }
);

Each execution returns an “Either” monad with the results or an exception. This type is fairly complex, so it is aliased as as type called “QueryResults”.

using QueryResults = 
  System.Tuple<
    string, 
    Monad.Either<
      System.Data.SqlClient.SqlDataReader, 
      System.Exception>>;

Once the parallel loop finishes, we get all the results, treating the first as the “primary” database that the end user will see.

Once a bit of this data is logged, we can run some queries against the log RethinkDB to see which behaves better. RethinkDB uses Javascript for it’s query engine. Unfortunately there is no concise way to demonstrate the queries for this project, since anything complex ends up a series of map-reduce operations.

r.db('performance')
 .table('query_timings')
 .group('RunId')
 .reduce(
   function(left, right) {
     var original = left('DbId').eq('db1') ? right : left;
     var newDb = left('DbId').eq('db2') ? left : right;
     
     var d1 = original('Duration');
     var d2 = newDb('Duration');
     
     var delta = 
         d2.sub(d1);
     
     return {
       'Duration1': d1,
       'Duration2': d2,
       'Delta': delta,
       'Duration': left('Duration')
     }
   }).ungroup()
     .map(function(group) {
      return {
        'id': group('group'),
        'Delta': group('reduction')('Delta')
      }
     }).map(function(row) {
       return {
         'id': row('id'),
         'Delta': row('Delta'),
         'Count': 1
       }
     })
    .group('Count')
    .reduce(function(left, right) {
      var Delta = left('Delta').add(right('Delta'));
      var Count = left('Count').add(right('Count'));    
  
      return {
        Delta: Delta,
        Count: Count,
        Avg: Delta.div(Count)
      }
    })

Example output:

[
  {
    "group": 1 ,
    "reduction": {
      "Avg": 26548.960784313724 ,
      "Count": 51 ,
      "Delta": 1353997
    }
  }
]

If this were a virtualized or bare metal environment, we’d be looking to answer several questions:

  • Are these servers in the same data center?
  • If so, will the network traffic stay within the datacenter?
  • Is the underlying hardware being shared with other applications?
  • How often will the database experience due to memory or lock contention?
  • How do I know if I’ve overprovisioned?
  • How do I know if I’ve migrated the data correctly?

I’m most concerned with database performance, as this is the hardest to get right.

In this environment, testing performance and error rates is typically sufficient, since the only thing within your control is your choice of vendors and database migration, but this technique lets you prove out infrastructure changes.

While virtualized infrastructure doesn’t let you control the entire environment, you typically still have visibility into how it works. If you build everthing with virtual machines, you rely on your hosting provider to get network routing in their data center correct, but you can still use tracert to verify that traffic stays where it should.

A friend in finance tells me when rolling out product updates, they run the new version alongside the old for a month, to prevent the introduction of defects. Rather than invent replacements for tools like tracert, it’s far easier to run the same application with two different architectures, to see which works best.

If you’re interested in the full source to the C# solution, it is available on github.