{"id":2561,"date":"2015-09-02T02:06:03","date_gmt":"2015-09-02T02:06:03","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=2561"},"modified":"2015-09-02T02:06:03","modified_gmt":"2015-09-02T02:06:03","slug":"efficient-storage-of-large-ids-in-postgres","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/efficient-storage-of-large-ids-in-postgres\/","title":{"rendered":"Efficient storage of large IDs in Postgres"},"content":{"rendered":"<p>Imagine a system that uses large-ish hex string identifiers, e.g. similar to UUIDs. To make the math easy, lets say we make a table with a million have 20 character IDs:<\/p>\n<pre lang=\"sql\">\ncreate table ids1 (id character varying(20));\n\n-- takes 2091 ms\ninsert into ids1\nselect '0123456789ABCDEF0123'\nfrom generate_series(1, 1000000) num\n<\/pre>\n<p>We can get the size of the table pretty easily:<\/p>\n<pre lang=\"sql\">\nselect nspname || '.' || relname as \"relation\",\n    pg_size_pretty(pg_relation_size(C.oid)) AS \"size\"\nfrom pg_class C\nleft join pg_namespace on (N.oid = C.relnamespace)\nwhere nspname not in ('pg_catalog', 'information_schema') \n  and relname like 'ids%'\norder by pg_relation_size(C.oid) desc\n\npublic.ids1\t50 MB\n<\/pre>\n<p>So this isn&#8217;t a huge table, but if you have 20 tables with only a single ID, you have 1 GB, and pretty soon you&#8217;re at large sizes.<\/p>\n<p>We can inspect the type information, and see that we&#8217;re using 4 (!) bytes for each character, but since this is hex, each character is only worth a nibble (half byte), so we could be able to store this in ten bytes, rather than 96.<\/p>\n<pre lang=\"sql\">\nselect data_type, \n       character_maximum_length, \n       character_octet_length\nfrom information_schema.columns \nwhere table_name like 'ids%' \norder by column_name, table_name\n<\/pre>\n<p>character varying\t20\t80<\/p>\n<p>The character type looks the same, but lets try it, just to see:<\/p>\n<pre lang=\"sql\">\n-- 2300 ms to create \/ 50 MB disk\ncreate table ids2 (id character(20));\n<\/pre>\n<p>Postgres types can have both size and automatic checks on them. If you switch out to bytea, it does less checks, so it loads the table a lot faster, but it turns out it&#8217;s the same size:<\/p>\n<pre lang=\"sql\">\n2091 ms to create \/ 50 MB disk\ncreate table ids3 (id bytea);\n<\/pre>\n<p>There is also a UUID type. This is more like what we want &#8211; but it has to be 32 bytes. Hopefully the internals are more sane than using a string &#8211; when I built this table, we&#8217;ve now lost some time in construction, but also 8 MB:<\/p>\n<pre lang=\"sql\">\ncreate table ids4 (id UUID);\n\ninsert into ids4\nselect '0123456789ABCDEF0123000000000000'::UUID\nfrom generate_series(1, 1000000) num\n\n-- 2231 ms to create \/ 42 MB on disk\n<\/pre>\n<p>We can create a table that stores a bit array instead, which is interesting. Since we have 20 characters, and each is 4 characters, we need an 80 bit wide table:<\/p>\n<pre lang=\"sql\">\ncreate table ids5 (id bit(80));\n\n-- 2251 ms to create \/ 42 MB on disk\ninsert into ids5\nselect x'0123456789ABCDEF0123'::bit(80)\nfrom generate_series(1, 1000000) num\n<\/pre>\n<p>This is the same size as the UUID table but more accurately represents what we want (it would not surprise me if UUID used this internally).<\/p>\n<p>Just to convince ourselves that this is correct (and not a weird storage method like ASCII -> Int -> Bits), we can do the following:<\/p>\n<pre lang=\"sql\">\nselect x'A'::bit(4) -- 1010\nselect x'B'::bit(4) -- 1011\nselect x'C'::bit(4) -- 1100\n\nselect x'AB0C'::bit(16) -- 1010 1011 0000 1100\n<\/pre>\n<p>A lot of your ID disk space will end up being consumed by indexes, so it&#8217;s important to check that too. The effect on indexes is even more dramatic than on the base table:<\/p>\n<pre lang=\"sql\">\n-- 9621 ms \/ 39 MB\ncreate index idx1 on ids1(id); -- character varying\n\n-- 9772 ms \/ 39 MB\ncreate index idx2 on ids2(id); -- character(20)\n\n-- 1591 ms \/ 30 MB\ncreate index idx3 on ids3(id); -- bytea \n\n-- 2191 ms \/ 39 MB\ncreate index idx4 on ids4(id); -- uuid \n\n-- 3181 ms \/ 30 MB\ncreate index idx5 on ids5(id); -- bit(80)\n<\/pre>\n<p>I&#8217;m not sure why bytea loaded so fast &#8211; for this exercise I&#8217;m only interested in the total size, so bit(80) still wins.<\/p>\n<p>Depending on your situation, you may also be able to use integer columns. Be warned, however, that these cannot be cast to a bigint:<\/p>\n<pre lang=\"sql\">\nselect x'0123456789ABCDEF0123'::bigint\n\nERROR: bigint out of range\nSQL state: 22003\n<\/pre>\n<p>Nor a numeric:<\/p>\n<pre lang=\"sql\">\nselect x'0123456789ABCDEF0123'::numeric(20, 0)\nERROR:  cannot cast type bit to numeric\nSQL state: 42846\n<\/pre>\n<p>So, when using a system with excessively large data stores, you may be able to save some space by switching to bit strings. Integers data types can be feasible as well, but would be much easier if you start a system from scratch using unsigned integer IDs. They are otherwise quite small and preferred from the storage perspective, and popular in data warehousing situations (note that int types are usually multiples of some number of 2^n * 8 so you may have situations where bit() is still useful, like 24 string IDs).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine a system that uses large-ish hex string identifiers, e.g. similar to UUIDs. To make the math easy, lets say we make a table with a million have 20 character IDs: create table ids1 (id character varying(20)); &#8212; takes 2091 ms insert into ids1 select &#8216;0123456789ABCDEF0123&#8217; from generate_series(1, 1000000) num We can get the size &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/efficient-storage-of-large-ids-in-postgres\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Efficient storage of large IDs in Postgres&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,7,22,25,26,29],"tags":[160,437],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2561"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=2561"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2561\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=2561"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=2561"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=2561"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}