views:

118

answers:

7

given that a n-byte array can be represented as a 2*n character string using hex, is there a way to represent the n-byte array in less than 2*n characters?

for example, typically, an integer(int32) can be considered as a 4-byte array of data

+4  A: 

yes, using binary (in which case it takes n bytes, not surprisingly), or using any base higher than 16, a common one is base 64.

tobyodavies
A: 

Yes. Use more characters than just 0-9 and a-f. A single character (assuming 8-bit) can have 256 values, so you can represent an n-byte number in n characters.

If it needs to be printable, you can just choose some set of characters to represent various values. A good option is base-64 in that case.

JoshD
+1  A: 

How about base-64?

It all depends on what characters you're willing to use in your encoding (i.e. representation).

Assaf Lavie
+1  A: 

Base64 fits 6 bits in each character, which means that 3 bytes will fit in 4 characters.

Ignacio Vazquez-Abrams
+2  A: 

It might depend on the exact numbers you want to represent. For instance, the number 9223372036854775808, which requres 8 bytes to represent in binary, takes only 4 bytes in ascii, if you use the product of primes representation (which is "2^63").

TokenMacGuy
+6  A: 

The advantage of hex is that splitting an 8-bit byte into two equal halves is about the simplest thing you can do to map a byte to printable ASCII characters. More efficient methods consider multiple bytes as a block:


Base-64 uses 64 ASCII characters to represent 6 bits at a time. Every 3 bytes (i.e. 24 bits) are split into 4 6-bit base-64 digits, where the "digits" are:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

(and if the input is not a multiple of 3 bytes long, a 65th character, "=", is used for padding at the end). Note that there are some variant forms of base-64 use different characters for the last two "digits".


Ascii85 is another representation, which is somewhat less well-known, but commonly used: it's often the way that binary data is encoded within PostScript and PDF files. This considers every 4 bytes (big-endian) as an unsigned integer, which is represented as a 5-digit number in base 85, with each base-85 digit encoded as ASCII code 33+n (i.e. "!" for 0, up to "u" for 84) - plus a special case where the single character "z" may be used (instead of "!!!!!") to represent 4 zero bytes.

(Why 85? Because 845 < 232 < 855.)

Matthew Slattery
+1 for a great explanation, I've never heard of Base85 before.
GWW
A: 

Using 65536 of about 90000 defined Unicode characters you may represent binary string in N/2 characters.

Vovanium