tags:

views:

6

answers:

0

I have many bunches of binary data, ranging from 16 to 4096 bytes, which need to be stored to a database and which should be easily comparable as a unit (e.g. two bunches of data batch only if the lengths match and all bytes match). Strings are nice for that, but converting binary data blindly to a string is apt to cause problems due to character encoding/reinterpretation issues.

Base64 was a common method for storing strings in an era when 7-bit ASCII was the norm; its 33% space penalty was a little annoying, but not horrible. Unfortunately, if one is using UTF-16, the space penalty is 166% (8 bytes to store 3) which seems pretty icky.

Is there any common storage method for storing binary data in a valid Unicode string which will allow better efficiency in UTF-16 (and hopefully not be too horrible in UTF-8)? A base-32768 coding would store 240 bits in sixteen characters, which would take 32 bytes of UTF-16 or 48 bytes of UTF-8. By comparison, base64 coding would use 40 characters, which would take 80 bytes of UTF-16 or 40 bytes of UTF-8. An approach which was designed to take the same space in UTF-8 or UTF-16 might store 48 bits in three characters that would take eight bytes in either UTF-8 or UTF-16, thus storing 240 bits in 40 bytes of either UTF-8 or UTF-16.

Are there any standards for anything like that?