Fixing NullPointerException in EdmNavigationPropertyImplProv.getMultiplicity

Sometimes you may receive a NullPointerException in the getMultiplicity function of EdmNavigationPropertyImplProv.java, part of the Olingo OData library.

This is caused by not having set associations / association sets up correction (i.e. foreign keys). To fix the problem you either need to fix your associations and association sets to match and have ‘setMultiplicity’ calls, or remove the associations entirely.

Fixing OData/Olingo error: “Could not find an entity set or function import for ‘x’.”

Sometimes with Apache Olingo you see this error message:

Could not find an entity set or function import for 'my_table'.

The Olingo sample applications register types like so:

static final String 
  ENTITY_SET_NAME_CARS = "Cars";
 
private static final 
  FullQualifiedName ENTITY_TYPE_1_1 = 
    new FullQualifiedName(
      NAMESPACE, 
      ENTITY_NAME_CAR);
 
...
 
List<EntitySet> entitySets = 
  new ArrayList<EntitySet>();
 
entitySets.add(
  getEntitySet(
    ENTITY_CONTAINER, 
    ENTITY_SET_NAME_CARS));

So, all you need to do to fix this error is to fill out these spots for your new type. To set up EntitySet you can do the following:

new EntitySet()
  .setName(name)
  .setEntityType(
    new FullQualifiedName(NAMESPACE, name)
   )

Note that you may want to use different values for ‘setName’ and ‘entityType’ – in the sample apps they use the pluralized form in setName and the singular in entity type.

Fixing Olingo Error “Format of ‘1’ is not compatible with ‘Edm.String’.”

When changing types in Olingo OData services, you may get an error like the following:

Format of '1' is not compatible with 'Edm.String'.

This indicates you have a line like this:

return type.valueOfString(key.getLiteral(), 
  EdmLiteralKind.DEFAULT, 
  property.getFacets(), 
  Integer.class);

Which should in fact be like this:

return type.valueOfString(
  key.getLiteral(), 
  EdmLiteralKind.DEFAULT, 
  property.getFacets(), 
  String.class);

Tips for Debugging Olingo Services

If you get an error debugging an Olingo service (one of the Java libraries for OData), you often get an error like “An exception occurred.”

This, apparently, is a result of decision by the Olingo team to not be dependent on a specific logging library (although I’m not sure how this makes sense, since they are dependent on CXF, which seems pretty heavy).

The simple solution to this problem they recommend is to add a callback method to your ODataServiceFactory implementation:

public <T extends ODataCallback> T getCallback(final Class<? extends ODataCallback> callbackInterface)
{ 
  return (T) (callbackInterface.isAssignableFrom(ScenarioErrorCallback.class) ? 
    new ScenarioErrorCallback() : 
      callbackInterface.isAssignableFrom(ODataDebugCallback.class) ? 
        new ScenarioDebugCallback() : 
        super.getCallback(callbackInterface));
}

And then, classes for each:

private final class ScenarioDebugCallback implements ODataDebugCallback {
  @Override
  public boolean isDebugEnabled()
  { 
    return true; 
  }
}
 
private final class ScenarioErrorCallback implements ODataDebugCallback {
@Override
  public boolean isDebugEnabled()
  { 
    return true; 
  }
}

Now, it seems silly that you’d want to have both of these, but this adds error messages to the JSON response, with each line of Java stack traces as a separate JSON object.

With these attributes set, add “odata-debug=json” to any Olingo URL and you’ll get debug information back.

Once you do this, you will then discover that the JSON payloads of Olingo are monstrous, and to deal with this you need a browser plugin – I like JSON Formatter, which auto-detects JSON, then formats and colorizes it:

json-formatter2

Getting Started with OData in Java

OData is a new web service protocol, developed by Microsoft, which sits somewhere between ODBC, Rest, and a business intelligence oriented SQL.

The protocol seems to be fairly complex, and as such, does not seem well-suited to “big data” workflows. It does, however have built in support for things like differential data loads, so there is some hope there, however minor. While Microsoft claims the big benefits of OData is API standardization, the big win currently is that a lot of Microsoft products speak the protocol: Excel 2013 can natively import data from OData services (and there is a plugin for 2010), as can Sharepoint. Notably the demo applications of Excel use access files with ~5-10 million row tables in them, which is well beyond what I would naively expect Excel to handle.

When you enable “PowerPivot”, you get a new ribbon:

excel-powerpivot

 

When you click ‘Manage’ you get to another screen, where you can import data from other Excel files, Access databases, real databases, or a web service. From here you can also define foreign key relationships, so once you get a bunch of data downloaded, you can join across multiple sources.

While this sounds like an abomination, there are people who already run entire businesses this way. For instance, I ran across a grocery store that would have a regular database for member information and inventory, and separate excel spreadsheets emailed to them by manufacturers – this all joined together in Excel to produce mailings, etc.

 

powerpivot2

 

One other thing of note is that you can do imports from the ‘Azure Data Market’, which is a bunch of free and for-sale data sets, potentially a good distribution channel for someone with valuable data.

powerpivot2-azure

 

If you’re strictly interested in developing OData API endpoints, Excel is still a nice tool, because you can trap the network traffic in Fiddler to see what the protocol looks like.

There are a few Java libraries to help support this protocol, although so far all seem a lot larger than one would really want (e.g. you end up pulling in a lot of dependencies, many of which will be incompatible what what you already use, e.g. CXF).

You can, however, get a really simple project up and running fairly quickly by following the maven instructions [[http://olingo.apache.org/doc/sample-setup.html]] (maven archetypes are just templates for projects, so these instructions aren’t really that clever)

mvn archetype:generate \
  -DinteractiveMode=false \
  -Dversion=1.0.0-SNAPSHOT \
  -DgroupId=com.sample \
  -DartifactId=my-car-service \
  -DarchetypeGroupId=org.apache.olingo \
  -DarchetypeArtifactId=olingo-odata2-sample-cars-service-archetype \
  -DarchetypeVersion=RELEASE \
 
cd my-car-service
 
mvn clean install
 
cd target

You then need to copy “my-car-service.war” into Tomcat, and when you load http://localhost:8080/my-car-service/, you’ll get this page:

mvn

 

To get Excel to read this, you need to use a slightly different URL:

http://localhost:8080/my-car-service/MyFormula.svc/

odata-import

 

One of the neat things about this protocol is that it allows Excel to auto-discover the table names:

 

 

 

odata-import2

 

So far, it just seems to import all the data, so be careful:

 

 

 

 

odata-import3

 

Now you have the data, and can pull a refresh whenever you want it, and that’s it!

 

 

odata-import4

How to fix MojoFailureException configuring Olingo

If you set up Olingo, they have you run a Maven archetype (project template) like so:

mvn archetype:generate \
  -DinteractiveMode=false \
  -Dversion=1.0.0-SNAPSHOT \
  -DgroupId=com.sample \
  -DartifactId=my-car-service \
  -DarchetypeGroupId=org.apache.olingo \
  -DarchetypeArtifactId=olingo-odata2-sample-cars-service-archetype \
  -DarchetypeVersion=RELEASE \

If you didn’t follow the instructions correctly, you’ll get the following error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-archetype-plugin:2.2:generate (default-cli) on project OlingoSampleClient: A Maven 2 project already exists in the directory d:\projects\odata\my-car-service -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging. 
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

All what this means is that you need to make a ‘project’ folder, move everything into it – Maven doesn’t like the .pom file in the current folder:

gary@GARY-PC /d/projects/odata
$ ls
my-car-service  pom.xml  project

“Postgres for Developers” – Notes from PGConf NYC 2014

I saw a talk by one of the core Postgres developers, which showed a bunch of interesting tricks to handle business rules in Postgres specific SQL. These are all things you could find by reading the documentation, but they are interesting enough to write up examples to highlight some interesting things you can do. A lot of these end up being useful for writing systems with immutable data (especially auditing, and sometimes reporting systems).

Example 1: Array Aggregation

“array_agg” can be used to combine rows, which sort of resembles a pivot table operation (this is the same set of values that would be passed as arguments to other aggregation functions)

SELECT y, array_agg(x) FROM (
  SELECT 1 x, 2 y
  UNION ALL
  SELECT 2 x, 2 y
  UNION ALL 
  SELECT 3 x, 3 y
) a
GROUP BY y
 
2;"{1,2}"
3;"{3}"

If you use the above table as a common table expression, you can also rename the columns in the with block. You can even join on the arrays:

WITH t(a, b) AS
(
  SELECT y, array_agg(x) FROM (
    SELECT 1 x, 2 y
    UNION ALL
    SELECT 2 x, 2 y
    UNION ALL 
    SELECT 3 x, 3 y
  ) a
  GROUP BY y
)
SELECT * 
FROM t t1 JOIN t t2 ON t1.b[2] = t2.a
 
2;"{1,2}";2;"{1,2}"

Example 2: Named Window Functions

I’m not sure yet whether this is just syntactic sugar or has real value, but you can set up named “windows.”

By way of explanation, a lot of times when you start using aggregate functions (min, max, array_agg, etc), you end up using window functions, which resemble the following:

SELECT a, MAX(b) OVER (partition BY a) 
FROM (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
) t1
 
1;2
1;2
2;1

These allow you do calculate aggregate functions (like min/max) without combining all the rows.

For instance, if you sort these values, you can find the “next” or “previous” row in the partition, which is pretty standard sql stuff:

SELECT a, lag(b) OVER (partition BY a ORDER BY b) 
FROM (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
) t1
 
1;
1;1
2;

If you use the above table as a common table expression, you can then rename the columns in the WITH block. You can even join on the arrays:

WITH t(a, b) AS
(
  SELECT y, array_agg(x) FROM (
    SELECT 1 x, 2 y
    UNION ALL
    SELECT 2 x, 2 y
    UNION ALL 
    SELECT 3 x, 3 y
  ) a
  GROUP BY y
)
SELECT * 
FROM t t1 JOIN t t2 ON t1.b[2] = t2.a
 
2;"{1,2}";2;"{1,2}"

What’s cool is you can move the “over partition by” part out of the query to the end as a named window, which presumably would be really nice if you had a lot of them, or wanted to re-use the same window for multiple fields:

SELECT a, lag(b) OVER w
FROM (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
) t1
window w AS (partition BY a ORDER BY b) 
 
1;
1;1
2;

Example 3: Ranges
Postgres has a really cool feature, as of 9.2, where you can query whether something is in a range (ranges are a special type, kind of like the arrays above). This example is a bit contrived, to show that you could combine array_agg and range creation:

WITH _data AS (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
  UNION ALL
  SELECT 2 a, 2 b
),
_history AS (
  SELECT a, array_agg(b) _start, array_agg(b) _end
  FROM _data
  GROUP BY a
)
SELECT a, 
       _start[1], 
       _end[1], 
       int4range(_start[1]::INTEGER, _end[2]::INTEGER, '(]'::text) 
FROM _history
 
1;1;1;"[2,3)"
2;1;1;"[2,3)"

There are a bunch of range types built in (based on numerics, timestamps). Note that you can specify whether the endpoints on ranges are inclusive or exclusive.

You can find out if a data value is within a range with the @> operator and see if two ranges overlap with &&. This set of functionality is great for exploring audit records – if you make a range with “[valid_from, valid_to)” you can query to find out what rows were effective on a particular date/time, for instance.

If you’re in this area of functionality, also check out btree_gist indexes, which may be helpful for tuning this.

Example 4: DISTINCT ON
Postgres has a feature to pull back the first value for a row in a group by. I assume this is a performance feature, but at the least it’s a very concise syntax for something that would otherwise require the use of RANK(). I imagine that you’d always want to use an ORDER BY with it.

The example from the docs for this one is pretty clear:

SELECT DISTINCT ON (location) location, TIME, report
FROM weather_reports
ORDER BY location, TIME DESC;

There are a few other features that got a lot of play at the conference (e.g. foreign data wrappers) – more to come.

Postgres High Availability (Talk Notes, PGConf 2014)

I saw a talk by some operations engineers who work for ARIN at PGConf 2014.

They presented their architecture for high availability, which uses CMAN, Corosync, and Pacemaker, which handle message queueing, a quorum system and Postgres communication, respectively (Corosync is functioning similar to Zookeeper, as I understand it).

Their architecture was interesting for a couple reasons, it uses a three node system (master, synchronous slave and asynchronous slave), and it uses the network switch to cut off bad nodes. Nodes are tasked with monitoring other nodes, and when one node can’t communicate with another, it sends a message to others to request they check as well. A different node would be responsible to agree that some node is down, in which case, it notifies the switch that it needs to “fence” the bad node (i.e. cut off all network traffic).

In a failure scenario, the synchronous node can immediately become master, and the async must catch up and become synchronous – once the bad node is fixed, it re-enters the pool as asynchronous, which allows it to catch up slowly. It’s worth noting that if there is no synchronous node after a failure, the async node will be forced to catch up before the system can start running again, which will trigger a pause for end users.

Other ways to implement this (without a switch) would include iptables. Interestingly, the ARIN team came up with a list of ~45 failure scenarios, which they apparently tested in about three weeks (clearly this can be a lot of effort). Also interesting, a member of the audience mentioned an alternate design using dark fiber to connect SCSI connections on hard drives in what sounded like a RAID, for replication, but I don’t think most of my readers are going to be doing that.

PGConfNYC Keynote Notes (Gilt)

One of the keynote presentations for PG Conf NYC was a founder of Gilt Groupe (a site for flash sales of fashion items). This is an interesting business arena, in that the application has an inherent spiky nature to load.

The speaker observed that people tend to form small groups based on how many people we can keep track of (100-150), and that corporate org structures tend to influence software architecture. Consequently they introduced a concept within their corporate IT called “micro services”, where teams have the ability to provision resources (e.g. a docker instance running postgres + some application) – the cluster of dependent services then resembles clusters of human relationships, where some powerful people are at the center, fanning out toward newer or more rural notions.

The nice consequences of this approach are that it assumes a natural level of failure (so you have to plan for it), and forces everyone to build relatively independent services, a similar model to how AWS was formed. Assuming failure also forces you to build in things like alerting, which (apparently) makes auditors happy, while also forcing you to run your infrastructure correctly. It sounds like they also have a concept where they can grant someone a temporary privilege to view a production database with an expiring account, which is a neat idea.

Like the other keynote he had a few interesting ideas for areas to improve postgres: append-only tables (in a similar vein to immutable data) and tables that store diffs (again with concepts around immutable data).

PGConfNYC: KeyNote notes (Goldman Sachs)

One of the the keynotes addresses to PGConf NYC was a talk by a technology manager at Goldman Sachs who had a bunch of interesting insights into the ecosystem. In their environment, Postgres is one of a number of supported platforms, so they have a good position for comparison (including to a nameless “commercial database”, the name of which can probably be easily inferred).

The most interesting insight to me is that for large companies, the marginal cost of installing a database is very low: most of the cost of Oracle is having it at all, because you typically aren’t provisioning hardware, etc just to install a new application. On the other side, even if you remove the licensing costs, the total cost of ownership of an open-source database is still quite significant, given that someone still needs to build tooling around the database (e.g. for replication, auditing, training, monitoring, and the like). It’s also useful for a large company to have someone to call one way or the other, as you can’t post anything that might give away secrets in public lists.

Also of interest, there is a lot of talk at the conference around audit trails (and Auditors and Regulators) which dovetails nicely with the notion of immutable data. As this is a topic of interest in parts of the wider tech community, I’d expect a lot of developments in this area in the future.

He suggested a few new features that might be valuable: more programmatic configuration (vs. just editing files), a service name concept (similar to Oracle’s tnsnames, I assume) where you can address a database as a service without necessarily knowing where it is (similar to DNS), and one of my favorites, APIs that run alongside SQL (other ways to access the contents of the DB).

Overall I find it very insightful to see how different people are running their infrastructure, with varying scale and business constraints.