Changes between Initial Version and Version 1 of DatabaseUuidEfficiency


Ignore:
Timestamp:
Dec 15, 2010, 1:16:43 PM (9 years ago)
Author:
flip
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DatabaseUuidEfficiency

    v1 v1  
     1= Efficiency, UUIDs and SQLite =
     2
     3A [http://en.wikipedia.org/wiki/Universally_unique_identifier UUID] can
     4be represented as a a 32 or 36 character string using printable ASCII,
     5a 16 byte string (with each byte in the range 0x00 - 0xff, making the
     6string non-printable) or a 16 byte integer.
     7
     8We store our UUIDs in the most readable format which is also the
     9fattest -- as a 36 character string like
     10`e8d1860e-0a09-41b89357-c3024e8394b2`. (It's the same UUID with or
     11without four dashes which is why they can be represented as 32 or
     1236 character strings.)
     13
     14This document examines the pros and cons of storing UUIDs this way
     15and examines alternatives.
     16
     17== Advantages and Disadvantages of the Current Scheme ==
     18
     19The advantages of this scheme are simple. The UUID
     20strings are recognizable to some as UUIDs (to some) and easy to read
     21for everyone. They also match the strings we write to XML files.
     22
     23The disadvantage is mainly that of space. It's my understanding that
     24SQLite is generally very good at not wasting space. Nevertheless,
     25a 36 character string is going to take at least 36 bytes to store.
     26
     27However, for almost all of the tables in the current database, the
     28difference hardly matters. (As of this writing, only Simulation's
     29tables are stored in the main database.) Most tables that contain
     30UUIDs will contain at most maybe 2000 rows (and often more like 200).
     312000 * 36 = 72000, and 72000/1024 = 70 kilobytes of data. In other
     32words, peanuts.
     33
     34However, the simulations table is likely to be large (tens of thousands of
     35rows or more) and it contains not one but two UUIDs. That means there's 72
     36UUID bytes per row. Assuming one million rows, that's 68 megabytes of UUIDs.
     37
     38In terms of available disk space on the average computer, that's still
     39peanuts. But in terms of the amount of data that SQLite needs to read from
     40disk during a query, it's bad. The operating system reads from the disk in
     41blocks (4k is a common size) and the more rows that can fit in a block, the
     42more likely it is that a row will be cached in memory when SQLite requests it
     43from disk. Bigger rows mean fewer rows per block and fewer rows cached in
     44memory.
     45
     46It would be nice to shrink our UUID storage. Is that possible?
     47
     48== Shrinking UUIDs - INTEGERs and REALs ==
     49
     50As mentioned above, a 36 character UUID can also be represented as a 128 bit
     51(16 byte) integer. Unfortunately, SQLite's INT type is 64 bit (8 bytes) and it
     52offers no facility for storing larger ints.
     53
     54SQLite has a REAL type which is a 64 bit/8 byte IEEE float. It's
     55possible to convert a 16 byte integer into an 8 byte float, but the
     56number doesn't survive the round trip due to floating point noise.
     57Here's an example from a Python interpreter session:
     58
     59{{{
     60#!python
     61>>> import uuid
     62>>> my_uuid = uuid.uuid4()
     63>>> my_uuid
     64UUID('379cdd33-f231-4122-b78d-7481557f13c1')
     65>>> my_uuid.int
     6673922024606215418702568719629565236161L
     67>>> # It's possible to recreate my_uuid from this int value
     68...
     69>>> my_uuid == uuid.UUID(int=73922024606215418702568719629565236161L)
     70True
     71>>> # But I can't do so if the value passes through float() on the way
     72...
     73>>> my_uuid == uuid.UUID(int=long(float(my_uuid.int)))
     74False
     75>>> # ...and we can see why...
     76...
     77>>> uuid.UUID(int=long(float(my_uuid.int)))
     78UUID('379cdd33-f231-4200-0000-000000000000')
     79}}}
     80
     81== Shrinking UUIDs - raw bytes ==
     82
     83So if we can't store UUIDs as ints or reals, how about 16 byte strings? This
     84might be possible but it's not easy and starts to involve diminishing returns.
     85
     86Inserting raw bytes into SQLite as strings isn't easy. In my experimentation,
     87I found that byte strings that contained 0x00 (the C NULL terminator) got
     88truncated on the way into SQLite. This is a real possibility when
     89representing UUIDs as byte strings. See the Python interpreter session
     90below for an example. Note that the 6th byte is 0x00. When inserted into
     91SQLite, this will result in a string with length == 5.
     92
     93{{{
     94#!python
     95>>> import uuid
     96>>> my_uuid = uuid.UUID("9a7c9069-9500-451e-85df-29c390636071")
     97>>> [hex(ord(c)) for c in my_uuid.bytes]
     98['0x9a', '0x7c', '0x90', '0x69', '0x95', '0x0', '0x45', '0x1e', '0x85', '0xdf', '0x29', '0xc3', '0x90', '0x63', '0x60', '0x71']
     99}}}
     100
     101I'm sure that we could work around this with some hacking, but this gets
     102into the diminishing returns I mentioned above. Possible, maybe, but
     103too complicated.
     104
     105== Shrinking UUIDs - Double INTEGERs ==
     106
     107A 128 bit integer can be represented as two 64 bit integers. This would
     108require two columns to represent each UUID, however. That could possibly be
     109made transparent on some level with clever use of VIEWs and possibly custom
     110functions, but this is too complicated for us.
     111
     112== Shrinking UUIDs - Shadow ids ==
     113
     114Remember that the problem isn't so much the UUIDs on the unique objects
     115themselves, it's the table(s) that point to these objects. Instead of trying
     116to shrink UUIDs, an alternative is to give each object a second, more compact
     117unique identifier. A traditional AUTOINCREMENT INTEGER is an obvious choice.
     118This id would be randomly assigned by the database and would have nothing to
     119do with the UUID other than occupying the same row. In other words, it would
     120shadow the UUID, following it around but always subordinate to it.
     121
     122Integers are very compact in SQLite. They use 1 to 9 bytes; as little
     123space as necessary is allocated to represent the number. Most of
     124the ids we would use would fit into 2-4 bytes which is a lot smaller
     125than 36.
     126
     127
     128== Conclusion ==
     129
     130Storing a more compact UUID representation in SQLite via Python might be
     131possible, but it isn't easy. More to the point, it's not worth the
     132trouble for us.
     133
     134If the size of UUIDs becomes too painful at some point, the best
     135solution is probably using shadow ids.
     136