views:

246

answers:

7

Can anyone suggest compression algorithms to operate on numeric strings of 20-30 digits ?

+2  A: 

Make it 2 15 digit numbers and convert them to 2 64 bit integers? Or are they floats?

extraneon
No they are not floats. Can you explain the answer a bit more? Thanks.
Vijay Dev
+7  A: 

You can easily compress 30 character string down to 15 bytes by just using binary representations of each digit. For example, 1592 can be represented as a series of four-bit values as such:

0001 0101 1001 0010

This, when grouped in groups of two four-bit values, can be represented as §Т in standard ASCII.

Further, if your strings contain many identical consecutive digits, you can implement a variation of Run-Length Encoding.

Anton Gogolev
Watch out if you're using C strings. The number 0 can be a problem since C strings are null-terminated.
luiscubal
This isn't as efficient as representing the entire string as a big integer. 4 bits can represent 16 different values, so this representation wastes 6/16 = 37.5% of the allocated space.
David Gladfelter
+1. Binary Coded Decimal (BCD), while not optimum, is a good first solution. It reduces storage cost by 50% and adds very little runtime overhead.
Jim Mischel
+1  A: 

One obvious solution is to "compress" them as a binary numeric representation rather than a string representation. See this stack overflow question for example libraries.

David Gladfelter
+2  A: 

Break it up into a couple of unsigned ints?

"9347692367596047327509604839"

becomes:

9 347692367 596047327 509604839

luke
The risk there is that if the left-most sub-sequence of the string is one or more zeroes, those get lost in your representation.
David Gladfelter
+1  A: 

I would definitely go for the easiest solution, and just store them as integers (of a suitable size, be it 32-bit, 64-bit or 128 bit, depending on needs). Compressing it with an algorithm supporting characters would waste a lot of space, since it would have to cater for a lot more than 10 different values (0-9) per character .

Erik A. Brandstadmoen
A nice idea, but 20 digits won't fit into a 64-bit integer. And although writing extended integer parsing and such isn't rocket science, it's far from trivial. In addition, such conversion would eliminate any leading zeros in the numeric string.
Jim Mischel
Leading zeros would have to be accounted for especially, of course, but it wasn't clear whether the string could contain and should preserve leading zeros.Observe, however, that I suggest a "suitable size", which would imply "something big enough to hold your data" :). The argument remains. Just use as many n-bit units you need to store the value.1) Store any leading zeros (if you wish to keep them)2) Read as many characters that you can store in a unit/word (32-bit,64-bit, whatever), and store them in a unit3) Repeat 2) until End of stream.4) Wrap the units/words in something readable
Erik A. Brandstadmoen
+1  A: 

Assuming you can have floating point numbers, you have a possibility of 11 symbols:

[0,1,2,3,4,5,6,7,8,9, .]

This means that you need 4 bits per symbol. 3 bits can only represent a maximum of 8 symbols. You can easily use 4 bits per each symbol and get a lot of compression.

If you only have integer digits in your string, an easy solution is to convert to hexidecimal and you can use 4 bits per symbol still while getting a better compression ratio. (since there are no wasted bits with 16 symbols)

If you use Huffman compression you will get an optimal bits/per symbol ratio. You can read more about Huffman compression here.

Brian R. Bondy
chose to go with this approach. Thanks Brian!
Vijay Dev
huffman compression is silly in this case because it is almost impossible to come up with a set of numbers that have non-near maximal entropy with respect to digits.
twolfe18
It's a better suggestion than using delta encoding, there is no proof that delta encoding will help him at all with the info the OP gave.
Brian R. Bondy
+1  A: 

one of the most common ways to compress numbers (assuming you have more than one you want to compress -- its kind of hard to compress one thing), is using delta encoding. It works on the principle that if you know the first number is x, and the numbers after it are relatively similar, you can encode the subsequent numbers as (x+c1), (x+c2), etc.

In this scheme, you only have to encode the full x value once, and if your c values are smaller than your x's, then you can save a lot of space. You can also use a version of this that sorts the numbers first, and then your delta refers to the number last seen instead of one number. With this method you can cover a wider range of numbers more efficiently.

twolfe18