data sample - Gary Sieling

I’ve built several reporting systems where work was divided evenly between a charting UI and database scripts – an ETL job, report sql, and database schema. It’s nice to divide work between UI and database developers to take advantage of specialization, but not having data is always an issue for the first week or two of the project.

Several architectural options present themselves for splitting off work. One can build a service that returns JSON- static at first and live data later. This works well for a highly specialized team. Another option is to populate the database with sample data. If there is an ETL script that moves data between databases, I like to populate the source database to simulate performance issues and exercise rare edge cases.

I’ve built out data with formulas in Excel spreadsheets – this is really easy to write and generates a data quickly, but with low variability. Formulas allow for some constraints. A typical spreadsheet might dictate that column A is a random string, column B is a random number from 1 to 10, column C is a random value from column A. The value of these files decays very quickly – as soon as a database schema changes they must be rebuilt. This method works well for a small team, where the database can built built quickly.

A second approach is to read the DB schema from dba_tables (Oracle)/pg_tables(Postgres) and generate data that satisfies various constraints. This functions longer than the Excel tables, but I find continually having to add constraints to Java code to be an onerous task.

I’ve long thought that a deductive logic-based programming language would allow for more pleasant constraint specification. Prolog is built around this linguistic concept. Rather than dictating instructions, like many languages, you list a database of facts and relationships. You provide the language with a goal which it solves through deductive reasoning. This is similar to how build systems like ant resolve build targets, except Prolog tolerates ambiguity- if multiple solutions to a goal exist, Prolog will happily tell you all of them, which is exactly the case you want when generating test data.

I finally had a chance to try this for a project. The following script will generate all subsets of a list of notes:

#!/usr/bin/swipl -q -f

subset([], []).
subset([E|Tail], [E|NTail]):-subset(Tail, NTail).
subset([_|Tail], NTail):- subset(Tail, NTail).

main :- 
	set_prolog_flag(toplevel_print_options,
                [quoted(true)]),
	subset(['A', 'A#', 'B', 'C', 'C#', 'D', 
                'D#', 'E', 'F', 'F#', 'G', 'G#'], X),
	write(X),
        nl,fail;true.

run :- main, halt.

To run this script, run this command line:

swipl -s test.pl -t run > output.txt

This generates output like the following:

[A,A#,B,C,C#,D,D#,E,F,F#,G,G#]
[A,A#,B,C,C#,D,D#,E,F,F#,G]
[A,A#,B,C,C#,D,D#,E,F,F#,G#]
[A,A#,B,C,C#,D,D#,E,F,F#]
[A,A#,B,C,C#,D,D#,E,F,G,G#]

The real power comes when you add constraints, like so:


#!/usr/bin/swipl -q -f

subset([], []).
subset([E|Tail], [E|NTail]):-subset(Tail, NTail).
subset([_|Tail], NTail):- subset(Tail, NTail).

size([],0).
size([_|T],N):-
    size(T,M),
    N is M+1.

validate(X) :- size(X, Y), Y < 5 -> write(X).

main :- set_prolog_flag(toplevel_print_options,
                [quoted(true)]),
	subset(['A', 'A#', 'B', 'C', 'C#', 'D', 
                'D#', 'E', 'F', 'F#', 'G', 'G#'], X),
	validate(X),
	nl,
	fail,
	true.

run :- main, halt.

This forces the script to return all subsets of the list with five notes or less. While the syntax feels esoteric, it is very concise, and may prove to be a good option for building out sample data in the future.

Tag: data sample

Using Prolog to Generate Test Data