{"id":1468,"date":"2013-07-25T02:50:21","date_gmt":"2013-07-25T02:50:21","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1468"},"modified":"2013-07-25T02:50:21","modified_gmt":"2013-07-25T02:50:21","slug":"generating-randomized-sample-data","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/generating-randomized-sample-data\/","title":{"rendered":"Generating Randomized Sample Data in Python"},"content":{"rendered":"<p>If you have access to a production data set, it is helpful to generate testing data which follows a similar format, in varying quantities. By introspecting a database, we can identify stated constraints. Given sufficient data volume, we can also infer implicit business process constraints. If preferred, we can also find records that may generate problems.<\/p>\n<p>As a thought experiment, I&#8217;ve considered how to acquire this information, as well as how to present it in rows using Python generators &#8211; to produce as little or much data as required.<\/p>\n<p>Consider a simple query, which can return datatype information from Oracle:<\/p>\n<pre lang=\"sql\">\nselect  \n   column_name,\n   data_type,\n   data_length,\n   data_precision,\n   nullable\nfrom all_tab_columns\nwhere schema = 'GARY'\nand table_name = 'TEST_TABLE'\n<\/pre>\n<p>For this example, I assume we want to generate one or two large tables &#8211; this exercise scales quickly when it expands to many tables. While it&#8217;s pretty easy to get the column sizes, it&#8217;s not that helpful &#8211; typically a UI will enforce additional constraints. If a value can hold a string, you typically also don&#8217;t want to generate random strings using every possible Unicode character, as you will get many atypical characters, and few of the ones most people use.<\/p>\n<p>You can improve this analysis by counting the number of distinct values for each column, and the distribution across each value. In many situations, you may do well to generate this for several columns, then pull random values for each:<\/p>\n<pre lang=\"sql\">\nselect count(distinct a) \nfrom gary.test_table;\n\nselect a, count(*) from gary.test_table \ngroup by a\n<\/pre>\n<p>Since databases typically represent their internal information in tables, we can also query for constraint information. Again, we&#8217;re using Oracle, but a similar technique can be used in other systems. Oracle lists types of constraint such as \u201cP\u201d, \u201cR\u201d, \u201cU\u201d (primary key, reference, unique).<\/p>\n<pre lang=\"sql\">\nSELECT distinct cc.table_name, \n                c.constraint_type, \n                cc.column_name\nFROM all_constraints c \nJOIN all_cons_columns cc \nON (c.owner = cc.owner AND c.constraint_name = cc.constraint_name)\nAND cc.table_name = 'TEST_TABLE';\n<\/pre>\n<p>Similar to the above, you can look up foreign key constraints. This is particularly important &#8211; this provides a second technique to find out what possible distinct values there can be:<\/p>\n<pre lang=\"sql\">\nselect c.table_name, cc.column_name\nfrom all_constraints c\nJOIN all_cons_columns cc \nON (c.owner = cc.owner AND c.constraint_name = cc.constraint_name)\nwhere c.constraint_name in \n  (select r_constraint_name\n  from all_constraints\n  where constraint_type in ('R')\n  and table_name = 'TABLE_NAME');\n<\/pre>\n<p>Up to this point, the database has provided several possible lists of distinct values and known constraints. This is less than helpful for a complete system &#8211; to flesh the idea out entirely, we&#8217;re going to have to make some inferences.<\/p>\n<p>Consider that each value could be an open or closed set &#8211; a list of vendors for a company might be large, but relatively fixed, whereas a numerical measurement or dollar figure is likely an open set. Primary keys are a special case of open sets- often they are unique and sequential.<\/p>\n<p>We can again run some tests, to filter these categories into groups. The following query gets several simple metrics:<\/p>\n<pre lang=\"sql\">\nselect count(a) c, count(distinct a) d, min(a) e, max(a) f\nfrom table \ngroup by a\n<\/pre>\n<p>From this we can infer quite a bit: if c = d, the value is completely unique. If f &#8211; e = d, we can also infer that we&#8217;re looking at an auto-increment column &#8211; in fact, we can test if it is near to d to achieve the same results, as records may have been deleted or skipped.<\/p>\n<p>We can also check if two columns are tied closely to each other-<\/p>\n<pre lang=\"sql\">\nselect a, b, count(*) c from table \ngroup by a,b\n\nselect distinct d from (\n  select (case when x > y then 1 else 0 end) d\n  from test_table\n)\n<\/pre>\n<p>If c is always 1, there is a one-to-one mapping of values, and if it is usually close to 1, we know the value are tightly tied. If, in the second query, d is always 1, we know that one value is always higher than the other.<\/p>\n<p>If values for c vary widely, but the results of this query are small (e.g. a few hundred rows), we could use this to establish a probability distribution, choosing b randomly for a given a.<\/p>\n<p>There is one last class of data which requires special consideration: rare data. If you wish to generate a very small dataset, which many special cases to improve testing, you may wish to do something like this:<\/p>\n<pre lang=\"sql\">\nselect x from (\n  select x, count(*) c\n  from test_table\n  group by x\n\n  join \n \n  select count(*) d\n  from test_table\n) where c\/d <= 0.05\n<\/pre>\n<p>If we run the above analysis on many sets of columns, we can then establish a series generator functions in python, one per column. We might, for instance generate data for a three column table, like so:<\/p>\n<pre lang=\"python\">\ndef row_gen():\n  (c1, c2, c3) = (col1(), col2(), col3())\n  for x in iter(int, 1):\n    yield (c1.next(), c2.next(), c3.next())\n<\/pre>\n<p>Then, each column can be defined by a simple generator function, resembling the output of the above queries, for unique values and range lists:<\/p>\n<pre lang=\"python\">\ndef col1():\n  for w in [\"abcd\", \"efgh\", \"ijkl\"]:\n    yield w\n\ndef col2():\n  for w in range(100, 1000):\n    yield w\n<\/pre>\n<p>The above samples merely introduce data from a saved dataset in order - in a real system what we want is the potential to generate infinte data, per some distribution, which the following function demonstrates.<\/p>\n<pre lang=\"python\">\ndef next_val(cumulative_prob, vals, i):\n  for idx in range(0, len(cumulative_prob)):\n    if i <= cumulative_prob[idx]:\n       return vals[idx]\n\ndef col3():\n  vals = ['a', 'b', 'c']\n  prob = [10, 5, 1]\n  \n  c = 0\n  cumulative_prob = []\n  for p in prob:\n    c = c + p\n    cumulative_prob.append(c)\n\n  s = sum(prob) \n  for x in iter(int, 1):\n    i = random.randint(1, s)\n    yield next_val(cumulative_prob, vals, i)\n\nl = 0\ncnt = Counter()\nx = col3()\nwhile (l < 10000):\n  cnt[x.next()] += 1\n  l = l + 1\n\ncnt\nOut[36]: Counter({'a': 6286, 'b': 3096, 'c': 618})\n<\/pre>\n<p>Thus, we can generate data - depending on the data set in question, you likely will wish to combine manual and automated analysis to generate sample data, and will likely make discoveries, especially where implicit dependencies and rare items are concerned (rare data values are often the cause of UI failure, and occasionally are also indicative of error conditions accrued over time).<\/p>\n<p>In some situations, it may also be helpful to simulate statistical distributions - the above examples tend to assume that you are dealing with large sets of text columns, but this clearly will work poorly in other scenarios. Likely this would require generating several probability distributions, and mapping each to the data, to see which fits best. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you have access to a production data set, it is helpful to generate testing data which follows a similar format, in varying quantities. By introspecting a database, we can identify stated constraints. Given sufficient data volume, we can also infer implicit business process constraints. If preferred, we can also find records that may generate &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/generating-randomized-sample-data\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Generating Randomized Sample Data in Python&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[29],"tags":[85,160,447,448,523,542],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1468"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1468"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1468\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1468"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1468"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1468"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}