views:

3357

answers:

3

I have been experimenting with using UUIDs as database keys. I want to take up the least amount of bytes as possible, while still keeping the UUID representation human readable.

I think that I have gotten it down to 22 bytes using base64 and removing some trailing "==" that seem to be unnecessary to store for my purposes. Are there any flaws with this approach?

Basically my test code does a bunch of conversions to get the UUID down to a 22 byte String, then converts it back into a UUID.

import java.io.IOException;
import java.util.UUID;

public class UUIDTest {

    public static void main(String[] args){
     UUID uuid = UUID.randomUUID();
     System.out.println("UUID String: " + uuid.toString());
     System.out.println("Number of Bytes: " + uuid.toString().getBytes().length);
     System.out.println();

     byte[] uuidArr = asByteArray(uuid);
     System.out.print("UUID Byte Array: ");
     for(byte b: uuidArr){
      System.out.print(b +" ");
     }
     System.out.println();
     System.out.println("Number of Bytes: " + uuidArr.length);
     System.out.println();


        try {
            // Convert a byte array to base64 string
            String s = new sun.misc.BASE64Encoder().encode(uuidArr);
            System.out.println("UUID Base64 String: " +s);
            System.out.println("Number of Bytes: " + s.getBytes().length);
            System.out.println();


            String trimmed = s.split("=")[0];
            System.out.println("UUID Base64 String Trimmed: " +trimmed);
            System.out.println("Number of Bytes: " + trimmed.getBytes().length);
            System.out.println();

            // Convert base64 string to a byte array
            byte[] backArr = new sun.misc.BASE64Decoder().decodeBuffer(trimmed);
      System.out.print("Back to UUID Byte Array: ");
      for(byte b: backArr){
       System.out.print(b +" ");
      }
      System.out.println();
      System.out.println("Number of Bytes: " + backArr.length);

      byte[] fixedArr = new byte[16];
      for(int i= 0; i<16; i++){
       fixedArr[i] = backArr[i];
      }
      System.out.println();
      System.out.print("Fixed UUID Byte Array: ");
      for(byte b: fixedArr){
       System.out.print(b +" ");
      }
      System.out.println();
      System.out.println("Number of Bytes: " + fixedArr.length);

      System.out.println();
      UUID newUUID = toUUID(fixedArr);
      System.out.println("UUID String: " + newUUID.toString());
      System.out.println("Number of Bytes: " + newUUID.toString().getBytes().length);
      System.out.println();

      System.out.println("Equal to Start UUID? "+newUUID.equals(uuid));
      if(!newUUID.equals(uuid)){
       System.exit(0);
      }


        } catch (IOException e) {
        }

    }


    public static byte[] asByteArray(UUID uuid) {

     long msb = uuid.getMostSignificantBits();
     long lsb = uuid.getLeastSignificantBits();
     byte[] buffer = new byte[16];

     for (int i = 0; i < 8; i++) {
      buffer[i] = (byte) (msb >>> 8 * (7 - i));
     }
     for (int i = 8; i < 16; i++) {
      buffer[i] = (byte) (lsb >>> 8 * (7 - i));
     }

     return buffer;

    }

    public static UUID toUUID(byte[] byteArray) {

     long msb = 0;
     long lsb = 0;
     for (int i = 0; i < 8; i++)
      msb = (msb << 8) | (byteArray[i] & 0xff);
     for (int i = 8; i < 16; i++)
      lsb = (lsb << 8) | (byteArray[i] & 0xff);
     UUID result = new UUID(msb, lsb);

     return result;
    }

}

output:

UUID String: cdaed56d-8712-414d-b346-01905d0026fe
Number of Bytes: 36

UUID Byte Array: -51 -82 -43 109 -121 18 65 77 -77 70 1 -112 93 0 38 -2 
Number of Bytes: 16

UUID Base64 String: za7VbYcSQU2zRgGQXQAm/g==
Number of Bytes: 24

UUID Base64 String Trimmed: za7VbYcSQU2zRgGQXQAm/g
Number of Bytes: 22

Back to UUID Byte Array: -51 -82 -43 109 -121 18 65 77 -77 70 1 -112 93 0 38 -2 0 38 
Number of Bytes: 18

Fixed UUID Byte Array: -51 -82 -43 109 -121 18 65 77 -77 70 1 -112 93 0 38 -2 
Number of Bytes: 16

UUID String: cdaed56d-8712-414d-b346-01905d0026fe
Number of Bytes: 36

Equal to Start UUID? true
+3  A: 

You don't say what DBMS you're using, but it seems that RAW would be the best approach if you're concerned about saving space. You just need to remember to convert for all queries, or you'll risk a huge performance drop.

But I have to ask: are bytes really that expensive where you live?

kdgregory
Yes, I think so... I want to save as much space as possible while still having it be human readable.
Grasper
OK, why do you think so? Are you storing a billion rows? You'll save 8 billion bytes, which isn't much. Actually, you'll save less, because your DBMS might reserve additional space for encoding. And if you go with VARCHAR instead of fixed-size CHAR, you're going to lose the space needed to save the actual length.
kdgregory
... and that "savings" is only if you use a CHAR(32). If you use RAW, you'll actually be saving space.
kdgregory
Any reasonable DBMS allows you to store UUIDs in native format, which requires 16 bytes. Any reasonable db tools will convert these to standard format (e.g. "cdaed56d-8712-414d-b346-01905d0026fe") in query results. People have been doing this for a long time. There's no need to re-invent the wheel.
Robert Lewis
+3  A: 

You can safely drop the padding "==" in this application. If you were to decode the base-64 text back to bytes, most libraries would expect it to be there, but since you are just using the resulting string as a key, it's not a problem.

I like Base-64 because its limited character set looks less like gibberish, but there's also Base-85. It uses more characters and codes 4 bytes as 5 characters, so you could get your text down to 20 characters.

erickson
A: 

I have an application where I'm doing almost exactly this. 22 char encoded UUID. It works fine. However, the main reason I'm doing it this way is that the IDs are exposed in the web app's URIs, and 36 characters is really quite big for something that appears in a URI. 22 characters is still kinda long, but we make do.

Here's the Ruby code for this:

  # Make an array of 64 URL-safe characters
  CHARS64 = ("a".."z").to_a + ("A".."Z").to_a + ("0".."9").to_a + ["-", "_"]
  # Return a 22 byte URL-safe string, encoded six bits at a time using 64 characters
  def to_s22
    integer = self.to_i # UUID as a raw integer
    rval = ""
    22.times do
      c = (integer & 0x3F)
      rval += CHARS64[c]
      integer = integer >> 6
    end
    return rval.reverse
  end

It's not exactly the same as base64 encoding because base64 uses characters that would have to be escaped if they appeared in a URI path component. The Java implementation is likely to be quite different since you're more likely to have an array of raw bytes instead of a really big integer.

Bob Aman