Fixing Oracle installer error PRVF-7501 : Sufficient space is not available at location “/tmp”

If your “/tmp” location is limited in space, you can experience the following error installing Oracle 12c:

INFO: INFO: ERROR: [ResultSet.addErrorDescription:1078] 
PRVF-7501 : Sufficient space is not available at location "/tmp" on node "ora" 
[Required space = 1GB ]

To fix this (and avoid various similar errors), you should make a location on a drive that you can write to (such as “/root/tmp” in my case). Then, make a shell script you use to call the Oracle installer, and change all tmp variables:

export TEMP=/root/tmp
export TMP=/root/tmp
export TMPDIR=/root/tmp

“TEMP” affects the error listed, which occurs a long way into the installer. TMPDIR affects the space check at the beginning.

Installing Solr 5.x on CentOS 7

To install Solr 5.+ on centos 7, you need a JDK, preferably 64 bit:

yum search openjdk
 
yum -y install java-1.8.0-openjdk.x86_64

Then, download from a mirror and extract Solr:

wget http://mirrors.gigenet.com/apache/lucene/solr/5.2.1/solr-5.2.1.tgz
 
tar xvf solr-5.2.1.tgz

If everything worked, it should be simple to start:

cd solr-5.2.1
 
bin/solr start -h 45.55.156.155 -p 8000

To test that Solr is working, you can use wget. This won’t work until it starts up, which takes a 10-30 seconds:

wget localhost:8000

Output:

--2015-08-02 18:57:19--  http://localhost:8000/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8000... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost:8000/solr/ [following]
--2015-08-02 18:57:19--  http://localhost:8000/solr/
Reusing existing connection to [localhost]:8000.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’
 
[ <=>                    ] 6,079     -.-K/s   in 0s
 
2015-08-02 18:57:19 (125 MB/s) - ‘index.html’ saved [6079]

If this fails after waiting, run solr again with “-f” to see the error.

bin/solr start -p 8000 -h 45.55.156.155 -p 8000

If you run out of memory, add the “-m” argument (e.g. -m 256m) – by default Solr uses 512mb.

bin/solr start -h 45.55.156.155 -p 8000 -f -m 256m

To access Solr’s administration pages, run SSH and enable port forwarding. With Putty this means adding “-L 8000:localhost:8000” (for our example) to the command line arguments. Once you connect this way, you can use “http://localhost:8000/solr/#/” to connect to the Solr admin panel.

Fixing “Cannot allocate memory” with Solr 5.2.1

If you run Solr on a VM, it may appear to start and fail.

bin/solr start -p 8000 -f

If you run it with “-f” (preventing backgrounding the process) you will then seen an out of memory error:

Starting Solr on port 8983 from /root/solr-5.2.1/server

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000e8000000, 402653184, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 402653184 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /root/solr-5.2.1/server/hs_err_pid16824.log

This is easily fixed by adding a memory argument (-m) – the default is 512 so if you’re hitting this error you’re likely using a cheap VM and need to lower this, or rebuild it with more RAM:

bin/solr start -p 8000 -f -m 256m

This is what you should see:

Starting Solr on port 8000 from /root/solr-5.2.1/server

0    [main] INFO  org.eclipse.jetty.util.log  [   ] – Logging initialized @1403ms
816  [main] INFO  org.eclipse.jetty.server.Server  [   ] – jetty-9.2.10.v20150310
914  [main] WARN  org.eclipse.jetty.server.handler.RequestLogHandler  [   ] – !RequestLog
948  [main] INFO  org.eclipse.jetty.deploy.providers.ScanningAppProvider  [   ] – Deployment monitor [file:/root/solr-5.2.1/server/contexts/] at interval 0
3991 [main] INFO  org.eclipse.jetty.webapp.StandardDescriptorProcessor  [   ] – NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet
4030 [main] WARN  org.eclipse.jetty.security.SecurityHandler  [   ] – ServletContext@o.e.j.w.WebAppContext@457e2f02{/solr,file:/root/solr-5.2.1/server/solr-webapp/webapp/,STARTING}{/solr.war} has uncovered http methods for path: /
4136 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  [   ] – SolrDispatchFilter.init()WebAppClassLoader=1582797472@5e5792a0
4164 [main] INFO  org.apache.solr.core.SolrResourceLoader  [   ] – JNDI not configured for solr (NoInitialContextEx)
4165 [main] INFO  org.apache.solr.core.SolrResourceLoader  [   ] – using system property solr.solr.home: /root/solr-5.2.1/server/solr
4167 [main] INFO  org.apache.solr.core.SolrResourceLoader  [   ] – new SolrResourceLoader for directory: '/root/solr-5.2.1/server/solr/'
4392 [main] INFO  org.apache.solr.core.SolrXmlConfig  [   ] – Loading container configuration from /root/solr-5.2.1/server/solr/solr.xml
4534 [main] INFO  org.apache.solr.core.CoresLocator  [   ] – Config-defined core root directory: /root/solr-5.2.1/server/solr
4566 [main] INFO  org.apache.solr.core.CoreContainer  [   ] – New CoreContainer 2073621255
4567 [main] INFO  org.apache.solr.core.CoreContainer  [   ] – Loading cores into CoreContainer [instanceDir=/root/solr-5.2.1/server/solr/]
4567 [main] INFO  org.apache.solr.core.CoreContainer  [   ] – loading shared library: /root/solr-5.2.1/server/solr/lib
4568 [main] WARN  org.apache.solr.core.SolrResourceLoader  [   ] – Can't find (or read) directory to add to classloader: lib (resolved as: /root/solr-5.2.1/server/solr/lib).
4603 [main] INFO  org.apache.solr.handler.component.HttpShardHandlerFactory  [   ] – created with socketTimeout : 600000,connTimeout : 60000,maxConnectionsPerHost : 20,maxConnections : 10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime : 5,sizeOfQueue : -1,fairnessPolicy : false,useRetries : false,
5305 [main] INFO  org.apache.solr.update.UpdateShardHandler  [   ] – Creating UpdateShardHandler HTTP client with params: socketTimeout=600000&connTimeout=60000&retry=true
5311 [main] INFO  org.apache.solr.logging.LogWatcher  [   ] – SLF4J impl is org.slf4j.impl.Log4jLoggerFactory
5327 [main] INFO  org.apache.solr.logging.LogWatcher  [   ] – Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)]
5330 [main] INFO  org.apache.solr.core.CoreContainer  [   ] – Node Name:
5331 [main] INFO  org.apache.solr.core.CoreContainer  [   ] – No authentication plugin used.
5547 [main] INFO  org.apache.solr.core.CoresLocator  [   ] – Looking for core definitions underneath /root/solr-5.2.1/server/solr
5599 [main] INFO  org.apache.solr.core.CoresLocator  [   ] – Found 0 core definitions
5609 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  [   ] – user.dir=/root/solr-5.2.1/server
5610 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  [   ] – SolrDispatchFilter.init() done
5651 [main] INFO  org.eclipse.jetty.server.handler.ContextHandler  [   ] – Started o.e.j.w.WebAppContext@457e2f02{/solr,file:/root/solr-5.2.1/server/solr-webapp/webapp/,AVAILABLE}{/solr.war}
5693 [main] INFO  org.eclipse.jetty.server.ServerConnector  [   ] – Started ServerConne

Validating Application Performance in a cloud environment, using C#, Hangfire, and RethinkDB

The rise of “platform” sites (e.g. Heroku) enables developers to build and deploy web applications cheaply, without understanding operational problems. Typically these products let you purchase a combination of three things: web servers, databases, and background job execution. This typically works well, as long as you stay within their platform, and as long as their pricings maps to what you are doing.

There are several “platform” vendors that are essentially AWS resellers, so you should be able to mix-and-match application services. For instance, if you wanted to upgrade a database, you can purchase a database from a company like Compose.io or from Amazon directly via RDS.

To test databases sold by different vendors, I built modified implementations of the C# database drivers which run the queries against multiple databases in parallel. These buffer the results, and log them to RethinkDB in a background thread using Hangfire. When you want a database connection, you do this:

connection = 
  new ProxyDbConnection(
    new List<IDbConnection>() {
      new SqlConnection(ConfigurationManager.ConnectionStrings["db1"].ToString()),
      new SqlConnection(ConfigurationManager.ConnectionStrings["db2"].ToString())
  });

Underneath, these log what you do to them, which lets us compare performance:

The log messages record everything we know about a transaction – the query, a hash of it’s parameters, the database, and an ID that links all of the concurrent operations into one group:

public class TimingRecord
{
  public String RunId { get; set; }
  public String DbId { get; set; }
  public DateTime RunDate { get; set; }
  public double? Duration { get; set; }
  public String Query { get; set; }
  public String QueryParms { get; set; }
  public String IP { get; set; }
  public String ErrorMessage { get; set; }
}

When a query completes or errors out, these messages are saved to a thread-safe collection. If the collection reaches some size, it is serialized and saved to RethinkDB in the background If RethinkDB shuts down or is not configured correctly, this behavior is no worse than it was prior to the introduction of this logging.

For background threads, I’m using Hangfire, which is nice in that you can make any public method with serializable arguments into a background process.

if (_buffer.Count >= 50)
{
   ConcurrentQueue<TimingRecord> dataToSend;
   lock (_buffer)
   {
     dataToSend = _buffer;
    _buffer = new ConcurrentQueue<TimingRecord>();
   }
 
   BackgroundJob.Enqueue(() => SaveTimingLog(dataToSend.ToArray()));            
}

Because RethinkDB lets you upload any object, saving the log messsages is trivial:

var ConnectionFactory = ConfigurationAssembler.CreateConnectionFactory("logging");
 
using (var connection = ConnectionFactory.Get())
{
  var table = Query.Db("performance").Table<TimingRecord>("query_timings");
  connection.Run(table.Insert(timings));
  connection.Dispose();
}

When you run a query using the proxy database connection, it runs against each of the provided connections:

var queryResults =
  Parallel.ForEach(
    _commands,
		(_command) => {
      ..
    }
);

Each execution returns an “Either” monad with the results or an exception. This type is fairly complex, so it is aliased as as type called “QueryResults”.

using QueryResults = 
  System.Tuple<
    string, 
    Monad.Either<
      System.Data.SqlClient.SqlDataReader, 
      System.Exception>>;

Once the parallel loop finishes, we get all the results, treating the first as the “primary” database that the end user will see.

Once a bit of this data is logged, we can run some queries against the log RethinkDB to see which behaves better. RethinkDB uses Javascript for it’s query engine. Unfortunately there is no concise way to demonstrate the queries for this project, since anything complex ends up a series of map-reduce operations.

r.db('performance')
 .table('query_timings')
 .group('RunId')
 .reduce(
   function(left, right) {
     var original = left('DbId').eq('db1') ? right : left;
     var newDb = left('DbId').eq('db2') ? left : right;
 
     var d1 = original('Duration');
     var d2 = newDb('Duration');
 
     var delta = 
         d2.sub(d1);
 
     return {
       'Duration1': d1,
       'Duration2': d2,
       'Delta': delta,
       'Duration': left('Duration')
     }
   }).ungroup()
     .map(function(group) {
      return {
        'id': group('group'),
        'Delta': group('reduction')('Delta')
      }
     }).map(function(row) {
       return {
         'id': row('id'),
         'Delta': row('Delta'),
         'Count': 1
       }
     })
    .group('Count')
    .reduce(function(left, right) {
      var Delta = left('Delta').add(right('Delta'));
      var Count = left('Count').add(right('Count'));    
 
      return {
        Delta: Delta,
        Count: Count,
        Avg: Delta.div(Count)
      }
    })

Example output:

[
  {
    "group": 1 ,
    "reduction": {
      "Avg": 26548.960784313724 ,
      "Count": 51 ,
      "Delta": 1353997
    }
  }
]

If this were a virtualized or bare metal environment, we’d be looking to answer several questions:

  • Are these servers in the same data center?
  • If so, will the network traffic stay within the datacenter?
  • Is the underlying hardware being shared with other applications?
  • How often will the database experience due to memory or lock contention?
  • How do I know if I’ve overprovisioned?
  • How do I know if I’ve migrated the data correctly?

I’m most concerned with database performance, as this is the hardest to get right.

In this environment, testing performance and error rates is typically sufficient, since the only thing within your control is your choice of vendors and database migration, but this technique lets you prove out infrastructure changes.

While virtualized infrastructure doesn’t let you control the entire environment, you typically still have visibility into how it works. If you build everthing with virtual machines, you rely on your hosting provider to get network routing in their data center correct, but you can still use tracert to verify that traffic stays where it should.

A friend in finance tells me when rolling out product updates, they run the new version alongside the old for a month, to prevent the introduction of defects. Rather than invent replacements for tools like tracert, it’s far easier to run the same application with two different architectures, to see which works best.

If you’re interested in the full source to the C# solution, it is available on github.

Computing Absolute Values in RethinkDB

It took me a bit to figure out how to do absolute values in RethinkDB, because you can’t use conditionals and mathematical operations that you might normally expect.

I expect that abs will be added to the language relatively soon, based on the outstanding tickets, but if you need it now, here is how to implement it:

  'aDelta': 
     r.branch(
        row('Delta').gt(0), 
        row('Delta'), 
        row('Delta').mul(0)
  ),

The following query is more complete if you want to see an example in action – this computes root-mean-square variance as well (unfortunately there does not appear to be a power function in the language as yet, so you can’t do the square root at the end that you’d want)

r.database('performance')
 .table('query_timings)
 .map(function(row) {
       return {
         'Delta': row('Delta'),
         'aDelta': 
            r.branch(
              row('Delta').gt(0), 
              row('Delta'), 
              row('Delta').mul(0)
           ),
         'rmsDelta': row('Delta').mul(row('Delta')),
         'Count': 1
       }
     })
    .group('Count')
    .reduce(function(left, right) {
      var Delta = left('Delta').add(right('Delta'));
      var aDelta = left('aDelta').add(right('aDelta'));
      var rmsDelta = left('rmsDelta').add(right('rmsDelta'));
      var Count = left('Count').add(right('Count'));    
 
      return {
        Delta: Delta,
        Count: Count,
        aDelta: aDelta,
        rmsDelta: rmsDelta,
        Variance: rmsDelta.exp(0.5),
        Avg: Delta.div(Count)
      }
    })

Using multiple aggregates (SUM, COUNT, AVG) in a RethinkDB query

With a RethinkDB query, you can’t do “count” and “sum” quite the same as you’d do in SQL.

Since it uses a map-reduce style engine underneath, you set count to “1” on each of your rows and then add these up later:

r.db('performance')
 .table('query_timings')
 .pluck('DbId')
 .group('DbId')
 .map(function(row) {
  return {
    DbId: row('DbId'),
    Duration: row('Duration'),
    count: 1
  }
  })
  .reduce(function(left, right) {
    var dur = left('Duration').add(right('Duration'));
    var cnt = left('count').add(right('count'));    
 
    var avg = cnt > 0 ? dur / avg : 0;
 
    return {
      Duration: dur,
      count: cnt,
      avg: avg
    }
  })

Note however, that with this implementation you’re computing the average over and over, which is not ideal.

If you want to do “count distinct” you’d want to create a hashmap to track the distinct values:

r.db('performance')
 .table('query_timings')
 .group('DbId')
 .map(function(row) {
    var IP = row('IP');
 
    var o = r.object(IP, 1);
 
    return {
      DbId: row('DbId'),
      IP: IP,
      o
    }
  })
  .reduce(function(left, right) {
    var qL = left('o');
    var qR = right('o');
 
    var m = qL.merge(qR);
 
    return {
      DbId: left('DbId'),
      IPs: m.keys(),
      found: m.keys().count(),
      o: m
    }
  })
  .ungroup()
  .map(function(group) {
    return {
        'DbId': group('group'),
        'found': group('reduction')('found')
    }
  })

RethinkDB: Truncate table

Like most relational databases, RethinkDB has several different commands to delete records from a database.

To test the behavior, I made several copies of an existing table I had, which took ~1 minute per:

r.db('performance')
 .tableCreate('query_timings2')
 
r.db('performance')
 .table('query_timings2')
 .insert(r.db('performance')
 .table('query_timings'))

Currently, the SQL to ReQL comparision guide lists this as the equivalent of “truncate”:

r.db('performance')
 .table('query_timings')
 .delete()

In my database, this took about the same amount of time as creating the original table. I believe this runs in a transaction, as when I abort the query the number of rows remains the same.

In many relational databases, “truncate” has a specific meaning; in Oracle it deletes the underlying filesystem from disk immediately and outside a transaction. In Postgres it runs within a transaction but skips any effort required to verify rows, which appears to match what an unqualified delete does in RethinkDB.

If you would like to replicate the Oracle behavior, dropping and re-creating a table seems to be much faster (I only know this experimentally, so I don’t know if this runs outside a transaction)

r.db('performance')
 .tableDrop('query_timings') -- 45 ms
 
r.db('performance')
 .tableCreate('query_timings')

RethinkDB has two other neat feature that I’ve not seen in other databases; the option to do a “hard” vs “soft” delete, and whether to return the deleted rows:

r.db('performance')
 .table('query_timings2')
 .delete({durability="soft", returnChanges=false})

In my test, the “soft” delete took about half the time of the “hard” delete (basically it is not waiting for the OS to confirm changes were written to disk).

Writing a “count distinct” query in RethinkDB

Writing queries to compute aggregates in RethinkDB is fairly simple; not so for “count distinct” style queries.

For this example we want to write a ReQL query that looks like this:

SELECT COUNT(DISTINCT IP), DbID FROM logs;

The first step is to group the data by your aggregation field, and add a tracking object. We’ll use this map for counting later. Note here that it is tempted to declare this as a nested object inside the return – for me this did not work.

In addition, RethinkDB internally creates it’s own Javascript objects, which is what allows you to execute a series of functions on them. You can’t do “var o = {}”, or it will fail later.

r.db('performance')
 .table('query_timings')
 .group('DbId')
 .map(function(row) {
    var IP = row('IP');
 
    var o = r.object(IP, 1);
 
    return {
      DbId: row('DbId'),
      IP: IP,
      o
    }
  })

To do the counting, we find our previous counting map and merge it (duplicate keys get overridden – that’s where the distinct comes in). Once complete, you can also return the list of distinct values if you like. Again, resist the temptation to think that because this is Javascript you can use existing Javascript code to merge objects (e.g. extend/apply or the “hasOwnProperty” method) – these will not work.

  .reduce(function(left, right) {
    var qL = left('o');
    var qR = right('o');
 
    var m = qL.merge(qR);
 
    return {
      DbId: left('DbId'),
      IPs: m.keys(),
      found: m.keys().count(),
      o: m
    }
  })

Once you have this done, you’ll have objects with two keys, “group” and “reduction”. You can fix this by ungrouping the results like so:

  .ungroup()
  .map(function(group) {
    return {
        'DbId': group('group'),
        'found': group('reduction')('found')
    }
  })

And after a lot of work, you get your data back:

[
  {
    "DbId":  "db027ceeffa00749a2b9e6a11f0003c13c" ,
    "found": 2
  }
]

RethinkDB WHERE clause examples

RethinkDB calls a WHERE clause “filter”, and you can write it in several forms.

One is the “equals” form:

r.db('performance')
 .table('query_timings')
 .filter({ 'Duration': 0 })

If you can actually use this form, you can put many filters in the one function call, although they are all equality comparisons, so they aren’t that useful:

r.db('performance')
 .table('query_timings')
 .filter({ 'Duration': 0, 'Day': 0 })

According to the documentation, you can use an index if you completely rewrite this form:

r.table("performance").index_create("Duration").run(conn)
 
r.table("users")
 .get_all("0", index="Duration")

The more useful form of a filter is to just write a function. However, beware that you have to use functions like “ne” for not equals (and eq, add, etc), as they aren’t able to send Javascript operators across.

r.db('performance')
 .table('query_timings')
 .filter(function(row) { return row('Duration').ne(null) })

Alternately, if you absolutely must use normal Javascript syntax, or are writing an Orm, you can pass the Javascript as a string, which is eval’ed on the server side. Comparing this with alternate implementations would be an interesting possibility here, to test the performance of various implementations of the same query.

r.db('performance')
 .table('query_timings')
  .filter(r.js('(function (row) { return row.Duration != null; })'))

They do offer one final shortcut method, which is to use “match”, which presumably returns a lamdba (note this uses the somewhat unexpected “r.row” syntax, so this is operating as some sort of static method):

r.db('performance')
 .table('query_timings').filter(
   r.row['Query'].match("^test.*")}
 )

This is likely better for performance than building you’re own regex in a function, as (hopefully) this is compiled prior to creating the lambda, rather than on each call to the function.

Liquibase autoIncrement startWith example

Liquibase doesn’t seem to have a command line option to let you set the start points on autoIncrement fields when you generate patches, but you can fix this easily by adding “startWith” after the fact (useful for a database migration):

<changeSet author="gary" id="1437853066645-52">
  <createTable tableName="messages">
    <column autoIncrement="true" startWith="150000" name="Id" type="bigint">
      <constraints primaryKey="true" primaryKeyName="pk_messages">
    </column>
  </createTable>
</changeSet>