wiki:DatabaseUuidEfficiency

Version 2 (modified by flip, 9 years ago) (diff)

--

Efficiency, UUIDs and SQLite

A UUID can be represented as a a 32 or 36 character string using printable ASCII, a 16 byte string (with each byte in the range 0x00 - 0xff, making the string non-printable) or a 16 byte integer.

We store our UUIDs in the most readable format which is also the fattest -- as a 36 character string like e8d1860e-0a09-41b89357-c3024e8394b2. (It's the same UUID with or without four dashes which is why they can be represented as 32 or 36 character strings.)

This document examines the pros and cons of storing UUIDs this way and examines alternatives.

If you like, you can jump straight to the conclusion.

Advantages and Disadvantages of the Current Scheme

The advantages of this scheme are simple. The UUID strings are recognizable to some as UUIDs (to some) and easy to read for everyone. They also match the strings we write to XML files.

The disadvantage is mainly that of space. It's my understanding that SQLite is generally very good at not wasting space. Nevertheless, a 36 character string is going to take at least 36 bytes to store.

However, for almost all of the tables in the current database, the difference hardly matters. (As of this writing, only Simulation's tables are stored in the main database.) Most tables that contain UUIDs will contain at most maybe 2000 rows (and often more like 200). 2000 * 36 = 72000, and 72000/1024 = 70 kilobytes of data. In other words, peanuts.

However, the simulations table is likely to be large (tens of thousands of rows or more) and it contains not one but two UUIDs. That means there's 72 UUID bytes per row. Assuming one million rows, that's 68 megabytes of UUIDs.

In terms of available disk space on the average computer, that's still peanuts. But in terms of the amount of data that SQLite needs to read from disk during a query, it's bad. The operating system reads from the disk in blocks (4k is a common size) and the more rows that can fit in a block, the more likely it is that a row will be cached in memory when SQLite requests it from disk. Bigger rows mean fewer rows per block and fewer rows cached in memory.

It would be nice to shrink our UUID storage. Is that possible?

Shrinking UUIDs - INTEGERs and REALs

As mentioned above, a 36 character UUID can also be represented as a 128 bit (16 byte) integer. Unfortunately, SQLite's INT type is 64 bit (8 bytes) and it offers no facility for storing larger ints.

SQLite has a REAL type which is a 64 bit/8 byte IEEE float. It's possible to convert a 16 byte integer into an 8 byte float, but the number doesn't survive the round trip due to floating point noise. Here's an example from a Python interpreter session:

>>> import uuid
>>> my_uuid = uuid.uuid4()
>>> my_uuid
UUID('379cdd33-f231-4122-b78d-7481557f13c1')
>>> my_uuid.int
73922024606215418702568719629565236161L
>>> # It's possible to recreate my_uuid from this int value
... 
>>> my_uuid == uuid.UUID(int=73922024606215418702568719629565236161L)
True
>>> # But I can't do so if the value passes through float() on the way
... 
>>> my_uuid == uuid.UUID(int=long(float(my_uuid.int)))
False
>>> # ...and we can see why...
... 
>>> uuid.UUID(int=long(float(my_uuid.int)))
UUID('379cdd33-f231-4200-0000-000000000000')

Shrinking UUIDs - raw bytes

So if we can't store UUIDs as ints or reals, how about 16 byte strings? This might be possible but it's not easy and starts to involve diminishing returns.

Inserting raw bytes into SQLite as strings isn't easy. In my experimentation, I found that byte strings that contained 0x00 (the C NULL terminator) got truncated on the way into SQLite. This is a real possibility when representing UUIDs as byte strings. See the Python interpreter session below for an example. Note that the 6th byte is 0x00. When inserted into SQLite, this will result in a string with length == 5.

>>> import uuid
>>> my_uuid = uuid.UUID("9a7c9069-9500-451e-85df-29c390636071")
>>> [hex(ord(c)) for c in my_uuid.bytes]
['0x9a', '0x7c', '0x90', '0x69', '0x95', '0x0', '0x45', '0x1e', '0x85', '0xdf', '0x29', '0xc3', '0x90', '0x63', '0x60', '0x71']

I'm sure that we could work around this with some hacking, but this gets into the diminishing returns I mentioned above. Possible, maybe, but too complicated.

Shrinking UUIDs - Double INTEGERs

A 128 bit integer can be represented as two 64 bit integers. This would require two columns to represent each UUID, however. That could possibly be made transparent on some level with clever use of VIEWs and possibly custom functions, but this is too complicated for us.

Shrinking UUIDs - Shadow ids

Remember that the problem isn't so much the UUIDs on the unique objects themselves, it's the table(s) that point to these objects. Instead of trying to shrink UUIDs, an alternative is to give each object a second, more compact unique identifier. A traditional AUTOINCREMENT INTEGER is an obvious choice. This id would be randomly assigned by the database and would have nothing to do with the UUID other than occupying the same row. In other words, it would shadow the UUID, following it around but always subordinate to it.

Integers are very compact in SQLite. They use 1 to 9 bytes; as little space as necessary is allocated to represent the number. Most of the ids we would use would fit into 2-4 bytes which is a lot smaller than 36.

Conclusion

Storing a more compact UUID representation in SQLite via Python might be possible, but it isn't easy. More to the point, it's not worth the trouble for us.

If the size of UUIDs becomes too painful at some point, the best solution is probably using shadow ids.

Attachments (1)

Download all attachments as: .zip