views:

99

answers:

6

What's the best way to compress or encode a list of numbers of arbitrary length and sizes into a single alphanumeric string?

The goal is to be able to convert something like 1,5,8,3,20,212,42 into something like a8D1jN to be used in a URL, and then back to 1,5,8,3,20,212,42.

For the resulting string I'm fine with any number and any ascii character, lowercase and uppercase, so: 0-9a-zA-Z. I prefer not to have any punctuation whatsoever.

UPDATE: added note about which characters are fine.

+1  A: 

Depends on the range of the numbers -- with a reasonable range a simple dictionary compression scheme could work.

Given your edit and estimate of 10k rows, a dictionary scheme where each number is mapped to a triple of [A-Za-z0-9] could be unique for 62*62*62 different entries.

dougvk
The numbers are ids in the database; I'm not sure if it would ever constitute a reasonable range (I don't expect it to grow a lot, definitely not above 10k), I'll look into dictionary compression. Thanks.
J. Pablo Fernández
+5  A: 

You can use an encoding scheme like the Base64.

Base64 modules or libraries are common in multiple programming languages.

Soravux
Base64 will convert to both uppercase and lowercase - that may not be what the OP wants. Base36 is closer, but lacks inbuilt implementations in most frameworks.
Michael Petrotta
I'm fine with uppercase and lowercase. What I'm not sure about is whether I should just use a coma as separator; seems like a waste if you think in binary, but I might be wrong.
J. Pablo Fernández
The comma would be needed if the numbers are not all the same size. If they are then you would be using extra space to represent the smaller numbers. For instance '1' takes one byte of space as a character, but 4 bytes as a number.
Patrick
+1  A: 

There might be a super cool and efficient algorithm for your case. However, a very simple, tested, and reliable algorithm would be to use a "common" encoding or compression algorithm on the comma seperated string of numbers.

There are many available to choose from.

Patrick
What do you mean by a common encoding or compression algorithm?
J. Pablo Fernández
The ones that come to mind right now are BZip, GZip, and Deflate.
Patrick
A: 

'best' depends on what your criteria are.

If best means simple : just string the numbers together seperated by a fixed character:

1a5a8a3a20a212a42

This should also be fast

If you want the resulting string to be small, you can run the string above through some compression algorithm like zip, then the result through some encoding like base64 or similar.

Jens Schauder
+1  A: 

If you consider your list as a string, then you have 11 different characters to encode (0-9 and comma). This can be expressed in 4 bits. If you were willing to add, say $ and ! to your list of acceptable characters, then you would have 64 different output characters, and thus be able to encode 6 bits per character.

This would mean that you could map the string to an encoded string that would be about 30% shorter than the original one, and fairly obfuscated and random looking.

This way you could transcode the number series [1,5,8,3,20,212,42] to the string "gLQfoIcIeQqq".

UPDATE: I felt inspired and wrote a python solution for this solution (not fast but functional enough...)

ZERO = ord('0')
OUTPUT_CHARACTERS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789$!"

def encode(numberlist):

    # convert to string -> '1,5,8,3,20,212,42'
    s = str(numberlist).replace(' ','')[1:-1]

    # convert to four bit values -> ['0010', '1011', '0110', ... ]
    # (add 1 to avoid the '0000' series used for padding later)
    four_bit_ints = [0 <= (ord(ch) - ZERO) <= 9 and (ord(ch) - ZERO) + 1 or 11 for ch in s]
    four_bits = [bin(x).lstrip('-0b').zfill(4) for x in four_bit_ints]

    # make binary string and pad with 0 to align to 6 -> '00101011011010111001101101...'
    bin_str = "".join(four_bits)
    bin_str = bin_str + '0' * (6 - len(bin_str) % 6)

    # split to 6bit blocks and map those to ints
    six_bits = [bin_str[x * 6 : x * 6 + 6] for x in range(0, len(bin_str) / 6)]
    six_bit_ints = [int(x, 2) for x in six_bits]

    # map the 6bit integers to characters
    output = "".join([OUTPUT_CHARACTERS[x] for x in six_bit_ints])

    return output

def decode(input_str):

    # map the input string from characters to 6bit integers, and convert those to bitstrings
    six_bit_ints = [OUTPUT_CHARACTERS.index(x) for x in input_str]
    six_bits = [bin(x).lstrip('-0b').zfill(6) for x in six_bit_ints]

    # join to a single binarystring
    bin_str = "".join(six_bits)

    # split to four bits groups, and convert those to integers
    four_bits = [bin_str[x * 4 : x * 4 + 4] for x in range(0, len(bin_str) / 4)]
    four_bit_ints = [int(x, 2) for x in four_bits]

    # filter out 0 values (padding)
    four_bit_ints = [x for x in four_bit_ints if x > 0]

    # convert back to the original characters -> '1',',','5',',','8',',','3',',','2','0',',','2','1','2',',','4','2'
    chars = [x < 11 and str(x - 1) or ',' for x in four_bit_ints]

    # join, split on ',' convert to int
    output = [int(x) for x in "".join(chars).split(',') if x]

    return output


if __name__ == "__main__":

    # test
    for i in range(100):
        numbers = range(i)
        out = decode(encode(numbers))
        assert out == numbers

    # test with original series
    numbers = [1,5,8,3,20,212,42]
    encoded = encode(numbers)
    print encoded         # prints 'k2UBsZgZi7uW'
    print decode(encoded) # prints [1, 5, 8, 3, 20, 212, 42]
AHM
+1  A: 

Instead of comma separating the numbers, you could do a simple encoding where you replace the last digit of each number with 'a'+digit. So, your list [1,5,8,3,20,212,42] would become mysterious looking bfid2a21c4c. :)

I would use something like this only if there are a handful of numbers, where compression won't be able to shorten the string a lot. If it s a lot of numbers we are talking about, you could try to perform some sort of compression + base64 encoding on the data instead.

Dysaster