Imagine a system that uses large-ish hex string identifiers, e.g. similar to UUIDs. To make the math easy, lets say we make a table with a million have 20 character IDs:
create table ids1 (id character varying(20)); -- takes 2091 ms insert into ids1 select '0123456789ABCDEF0123' from generate_series(1, 1000000) num
We can get the size of the table pretty easily:
select nspname || '.' || relname as "relation", pg_size_pretty(pg_relation_size(C.oid)) AS "size" from pg_class C left join pg_namespace on (N.oid = C.relnamespace) where nspname not in ('pg_catalog', 'information_schema') and relname like 'ids%' order by pg_relation_size(C.oid) desc public.ids1 50 MB
So this isn’t a huge table, but if you have 20 tables with only a single ID, you have 1 GB, and pretty soon you’re at large sizes.
We can inspect the type information, and see that we’re using 4 (!) bytes for each character, but since this is hex, each character is only worth a nibble (half byte), so we could be able to store this in ten bytes, rather than 96.
select data_type, character_maximum_length, character_octet_length from information_schema.columns where table_name like 'ids%' order by column_name, table_name
character varying 20 80
The character type looks the same, but lets try it, just to see:
-- 2300 ms to create / 50 MB disk create table ids2 (id character(20));
Postgres types can have both size and automatic checks on them. If you switch out to bytea, it does less checks, so it loads the table a lot faster, but it turns out it’s the same size:
2091 ms to create / 50 MB disk create table ids3 (id bytea);
There is also a UUID type. This is more like what we want – but it has to be 32 bytes. Hopefully the internals are more sane than using a string – when I built this table, we’ve now lost some time in construction, but also 8 MB:
create table ids4 (id UUID); insert into ids4 select '0123456789ABCDEF0123000000000000'::UUID from generate_series(1, 1000000) num -- 2231 ms to create / 42 MB on disk
We can create a table that stores a bit array instead, which is interesting. Since we have 20 characters, and each is 4 characters, we need an 80 bit wide table:
create table ids5 (id bit(80)); -- 2251 ms to create / 42 MB on disk insert into ids5 select x'0123456789ABCDEF0123'::bit(80) from generate_series(1, 1000000) num
This is the same size as the UUID table but more accurately represents what we want (it would not surprise me if UUID used this internally).
Just to convince ourselves that this is correct (and not a weird storage method like ASCII -> Int -> Bits), we can do the following:
select x'A'::bit(4) -- 1010 select x'B'::bit(4) -- 1011 select x'C'::bit(4) -- 1100 select x'AB0C'::bit(16) -- 1010 1011 0000 1100
A lot of your ID disk space will end up being consumed by indexes, so it’s important to check that too. The effect on indexes is even more dramatic than on the base table:
-- 9621 ms / 39 MB create index idx1 on ids1(id); -- character varying -- 9772 ms / 39 MB create index idx2 on ids2(id); -- character(20) -- 1591 ms / 30 MB create index idx3 on ids3(id); -- bytea -- 2191 ms / 39 MB create index idx4 on ids4(id); -- uuid -- 3181 ms / 30 MB create index idx5 on ids5(id); -- bit(80)
I’m not sure why bytea loaded so fast – for this exercise I’m only interested in the total size, so bit(80) still wins.
Depending on your situation, you may also be able to use integer columns. Be warned, however, that these cannot be cast to a bigint:
select x'0123456789ABCDEF0123'::bigint ERROR: bigint out of range SQL state: 22003
Nor a numeric:
select x'0123456789ABCDEF0123'::numeric(20, 0) ERROR: cannot cast type bit to numeric SQL state: 42846
So, when using a system with excessively large data stores, you may be able to save some space by switching to bit strings. Integers data types can be feasible as well, but would be much easier if you start a system from scratch using unsigned integer IDs. They are otherwise quite small and preferred from the storage perspective, and popular in data warehousing situations (note that int types are usually multiples of some number of 2^n * 8 so you may have situations where bit() is still useful, like 24 string IDs).
3 Replies to “Efficient storage of large IDs in Postgres”
in second script
WHERE nspname NOT (‘pg_catalog’, ‘information_schema’)
missed IN after NOT
How about the speed of bytea plus the space of your bit implementation? Convert your text hex to bytea decode(). I’d be curious to see your comparison numbers for this.
CREATE TABLE ids6 (id bytea);
INSERT INTO ids6
SELECT decode(‘0123456789ABCDEF0123’, ‘hex’)
FROM generate_series(1, 1000000) num;
CREATE INDEX idx6 ON ids6(id);