data integrity in sql - Gary Sieling

Tuning SQL queries is a useful skill, and while many people struggle to manage complex SQL, the work is actually a series of simple tricks.

For instance, refactoring a query often brings about algorithmic improvements, and if you tune enough queries, finding mathematically equivalent forms becomes muscle memory. This includes operations like inlining common table expressions, and knowing how to apply factoring (which you may need to try to remember from grade school).

When applying these operations, a few defects arise regularly. Consider the following:

select 'Gary' as customerName, '321 Street' as customerAddress
union all
select '111 Road' as customerAddress, 'Bob' as customerName

This is a valid query in Postgres – while it does type checking on each set in the union, it does not force you to name the columns the same.

This may look contrived, but it’s easier to make these happen in larger queries (e.g. long select lists)- the key point to understand is how easy it is to create subtle bugs.

Fortunately databases permit us to treat tables as sets, which allows you do do something like the following (note this assumes you have a unique key)

with a as (
  ... before ...
), b as (
  ... after ...
) 
select count(*) from a
union all
select count(*) from b
union all
select * from (
   select * from a 
   union
   select * from b
)

The “union” causes the inputs to be treated as sets, so they will be de-duplicated, whereas “union all” treats the inputs as a list, which forces the ordering.

For this query, you expect to see the count(*) to be the same from the “before”, “after” and combined sets.

To see the difference in rows, you can also do set differences:

with a as (
  ... before ...
), b as (
  ... after ...
) 
select * from a except select * from b
union 
select * from b except select * from a

And this will show you every row that is unique to either before or after – if this returns no results, you’ve left the query the same. If you have a lot of data, this can take some time to run, although it is typically far less than manual testing, but if you wish, you can add an equivalent where clause to the before and after queries.

Ideally you should test this on a copy of production data. There are still classes of problems you could miss – but at least you know no one will be able to find them (yet).

There is one more problem:

Lets say you swapped two columns, as in my original example. The set difference query will give you all the rows in each query. If you start with 100k rows, now you have 200k, and how are you supposed to figure that out?

It turns out to be pretty simple:

with a as (
  ... before ...
), b as (
  ... after ...
) 
select * from a except select * from b where id = 12345
union 
select * from b except select * from a where id = 12345

Find one that might be of interest, then apply your unique identifier. Now, you have two rows:

Then scroll across the result until you spot the difference:
pgadmin

This works well, whether you have dozens or hundreds of columns.

This is a very simple manual workflow, requiring no special tools- while I’ve used Postgres, it should work essentially the same in any relational database. If you don’t work in this style, it will save you tons of testing time, re-opened tickets, and remove the fear of changing areas of your product. As a developer, it can take you from the point where you struggle with “large” queries to where the size doesn’t matter.

Tag: data integrity in sql

Testing the output of tuned Postgres queries