views:

1522

answers:

6

Friends,

Here is the situation: An ASP.NET page has following query string parameter. MyServer.com/ShowSomething.aspx?IDs=1000000012,1000000021,1000000013,1000000022&...

Here IDs parameter will always have numbers separated by something, in this case ",". Currently there are 4 numbers but normally they would be in between 3-7.

Now, I am looking for method to convert each big number from above into smallest possible value; specifically compressing value of IDs query string parameter. Both, compressing each number algorithm or compressing whole value of IDs query string parameter are welcome.

1) Encode or decode is not an issue; just compressing the value IDs query string parameter. 2) Creating some unique small value for IDs and then retrieving its value from some data source is out of scope.

Please provide some algorithm to compress such big number to small value or algorithm to compress value of IDs query string parameter.

+13  A: 

You basically need so much room for your numbers because you are using base 10 to represent them. An improvement would be to use base 16 (hex). So for example, you could represent 255 (3 digits) as ff (2 digits).

You can take that concept further by using a much larger number base... the set of all characters that are valid query string parameters:

A-Z, a-z, 0-9, '.', '-', '~', '_', '+'

That gives you a base of 67 characters to work with (see Wikipedia on QueryString).

Have a look at this SO post for approaches to converting base 10 to arbitrary number bases.

EDIT:

In the linked SO post, look at this part:

string xx = IntToString(42, 
            new char[] { '0','1','2','3','4','5','6','7','8','9',
            'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
            'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x'});

That's almost what you need. Just expand it by adding the few characters it is missing:

yz.-~_+

That post is missing a method to go back to base 10. I'm not going to write it :-) but the procedure is like this:

Define a counter I'll call TOTAL.

Look at the right most character and find it's position in the array.
TOTAL = (the position of the character in the array) Example: Input is BA1. TOTAL is now 1 (since "1" is in position 1 in the array)

Now look at the next character left of the first one and find it's position in the array. TOTAL += 47 * (the position of the character in the array) Example: Input is BA1. TOTAL is now (47 * 11) + 1 = 518

Now look at the next character left of the previous one and find it's position in the array. TOTAL += 47 * 47 * (the position of the character in the array) Example: Input is BA1. Total is now (47 * 47 * 10) + (47 * 11) + 1 = 243508

And so on.

I suggest you write a unit test that converts a bunch of base 10 numbers into base 47 and then back again to make sure your conversion code works properly.

Note how you represented a 6 digit base 10 number in just 3 digits of base 47 :-)

Eric J.
Thanks Eric J.If I understand it, I should use higher base to convert it. If so, what number do you recommend to use as base?"... the set of all characters that are valid query string parameters:" Could you please explain it bit more?
Dave
<a href="en.wikipedia.org/wiki/Base64">Base64</a> is highly reccomended, and safer than base 67!
Oplopanax
@Dave: I recommend using Base 67, using the characters I list in the post. Those are the characters that are allowed to be used in a query string parameter without being URL encoded. Look at the link. It provides C# source code for going from base 10 to an arbitrary base. I'll edit my post to outline how to go back to base 10.
Eric J.
@EricThank you and will wait for your update. Also, if possible, please add details which could be concerns such as performance or best prac.
Dave
@Dave: Update in. The performance of this approach should be very good. The time to encode the numbers should be trivial compared to the time to call the web server across the internet.
Eric J.
@EricPerfect, it solves the problem.
Dave
@Oplopanax: Base64 (standard implementation) is not optimal because some characters that it uses will get url encoded, making the query string longer than necessary. Why is base 67 unsafe? Assuming Dave writes a unit test to ensure his conversion works, there's nothing unsafe about it as far as I can see.
Eric J.
I personally like base 62 (letters and numbers), but if you want to squeeze the most out, use the highest base... which supposedly is 67.
Mark
A: 

If the only issue is the URL length, you can convert numbers to base64 characters, then convert them back to numbers at the server side

Aziz
Base64 is not really optimal because the characters '+', '/' and '=' are all used, and they will be url encoded (making them much longer than necessary).
Eric J.
encoding strings to base64 encoding will make them larger not smaller (try it at http://www.opinionatedgeek.com/dotnet/tools/Base64Encode/Default.aspx). Base64 encoding is handy when you want to represent binary data in an ascii form, but doesn't offer any compression.
Darwyn
I didn't mean "convert string to base64" ... I was saying: "convert numbers to base64" .. i.e. convert the current decimal representation of the numbers to a base64 string, which should compress them. But I agree with Eric J, some characters should not be used.
Aziz
+3  A: 

What is the range of your numbers? Assuming they can fit in a 16-bit integer, I would:

  • Store all your numbers as 16-bit integers (2 bytes per number, range -32,768 to 32,767)
  • Build a bytestream of 16-bit integers (XDR might be a good option here; at very least, make sure to handle endianness correctly)
  • Base64 encode the bytestream, using the modified base64 encoding for URLs (net is about 3 characters per number)

As an added bonus you don't need comma characters anymore because you know each number is 2 bytes.

Alternatively, if that isn't good enough, I'd use zlib to compress your stream of integers and then base64 the zlib-compressed stream. You can also switch to 32-bit integers if 16-bit isn't a large enough range (i.e. if you really need numbers in the 1,000,000,000 range).

Edit:

Maybe too late, but here's an implementation that might do what you need:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Scratch {
    class Program {
        static void Main(string[] args) {
            //var ids = new[] { 1000000012, 1000000021, 1000000013, 1000000022 };
            var rand = new Random();
            var ids = new int[rand.Next(20)];
            for(var i = 0; i < ids.Length; i++) {
                ids[i] = rand.Next();
            }

            WriteIds(ids);
            var s = IdsToString(ids);
            Console.WriteLine("\nResult string is: {0}", s);
            var newIds = StringToIds(s);
            WriteIds(newIds);
            Console.ReadLine();
        }

        public static void WriteIds(ICollection<Int32> ids) {
            Console.Write("\nIDs: ");
            bool comma = false;
            foreach(var id in ids) {
                if(comma) {
                    Console.Write(",");
                } else {
                    comma = true;
                }
                Console.Write(id);
            }
            Console.WriteLine();
        }

        public static string IdsToString(ICollection<Int32> ids) {
            var allbytes = new List<byte>();
            foreach(var id in ids) {
                var bytes = BitConverter.GetBytes(id);
                allbytes.AddRange(bytes);                
            }
            var str = Convert.ToBase64String(allbytes.ToArray(), Base64FormattingOptions.None);
            return str.Replace('+', '-').Replace('/', '_').Replace('=', '.');
        }

        public static ICollection<Int32> StringToIds(string idstring) {
            var result = new List<Int32>();
            var str = idstring.Replace('-', '+').Replace('_', '/').Replace('.', '=');
            var bytes = Convert.FromBase64String(str);
            for(var i = 0; i < bytes.Length; i += 4) {
                var id = BitConverter.ToInt32(bytes, i);
                result.Add(id);
            }
            return result;
        }
    }
}
Daniel Pryden
Thanks Daniel,Its C# language and numbers could be like:1000000012,1000000021,1000000013,1000000022
Dave
87 chars to 44 charsThat's great Daniel. Thanks a lot.
Dave
Ohh... not able to mark this and first posts as answer.
Dave
@Dave: but you could upvote it :)
Ahmad Mageed
A: 

how patterned are the IDs you are getting? if digit by digit, the IDs are random, then the method I am about to propose won't be very efficient. but if the IDs you gave as an example are representative of the types you'd be getting, then perhaps the following could work?

i motivate this idea by example.

you have for example, 1000000012 as ID that you'd like to compress. why not store it as [{1},{0,7},{12}]? This would mean that the first digit is a 1 followed by 7 zeros followed by a 12. Thus if we use the notation {x} that would represent one instance of x, while if we use {x,y} that would mean that x occurs y times in a row.

you could extend this with a little bit of pattern matching and/or function fitting.

for example, pattern matching: 1000100032 would be [{1000,2}{32}].

for example, function fitting: if your IDs are 10 digits, then split the ID into two 5 digit numbers and store the equation of the line that goes through both points. if ID = 1000000012, the you have y1 = 10000 and y2 = 12. therefore, your slope is -9988 and your intercept is 10000 (assuming x1 = 0, x2 = 1). In this case, it's not an improvement, but if the numbers were more random, it could be. Equivalently, you could store the sequence of IDs with piecewise linear functions.

in any case, this mostly depends on the structure of your IDs.

B Rivera
Thanks Rivera. It is good idea actually.
Dave
A: 

I assume you are doing this as a workaround for request URL length restrictions ...

Other answers have suggested encoding the decimal id numbers in hex, base47 or base64, but you can (in theory) do a lot better than that by using LZW (or similar) to compress the id list. Depending on how much redundancy there is in your ID lists, you could get significantly more than 40% reduction, even after re-encoding the compressed bytes as text.

In a nut-shell, I suggest that you find an off-the-shelf text compression library implemented in Javascript and use it client side to compress the ID list. Then encode the compressed bytestring using base47/base64, and pass the encoded string as the URL parameter. On the server side do the reverse; i.e. decode followed by decompress.

EDIT: As an experiment, I created a list of 36 different identifiers like the ones you supplied and compressed it using gzip. The original file is 396 bytes, the compressed file is 101 bytes, and the compressed + base64 file 138 bytes. That is a 65% reduction overall. And the compression ratio could actually improve for larger files. However, when I tried this with a small input set (e.g. just the 4 original identifiers), I got no compression, and after encoding the size was larger than the original.

Google "lzw library javascript"

In theory, there might be simpler solution. Send the parameters as "post data" rather than in the request URL, and get the browser to apply the compression using one of the encodings that it understands. That will give you more savings too since there is no need to encode the compressed data into legal URL characters.

The problem is getting the browser to compress the request ... and doing that in a browser independent way.

Stephen C
+2  A: 

Here's another really simple scheme that should give good compression for a set of numbers of the form N + delta where N is a large constant.

public int[] compress(int[] input) {
    int[] res = input.clone();
    Arrays.sort(res);
    for (int i = 1; i < res.length; i++) {
        res[i] = res[i] - res[i - 1];
    }
    return res;
}

This should reduce the set {1000000012,1000000021,1000000013,1000000022} to the list [1000000012,1,9,1], which you can then compress further by representing the numbers in base47 encoding as described in another answer.

Using simple decimal encoding, this goes from 44 characters to 16 characters; i.e. 63%. (And using base47 will give even more compression).

If it is unacceptable to sort the ids, you don't get quite as good compression. For this example, {1000000012,1000000021,1000000013,1000000022} compresses to the list [1000000012,9,-8,9]. That is just one character longer for this example

Either way, this is better than a generic compression algorithm or encoding schemes ... FOR THIS KIND OF INPUT.

Stephen C
Neato. I like that it doesn't rely on a hardcoded `N`.
Mark
@Mark: ... and assuming that sorting is OK, it can cope with more than one value of N in the the set of numbers, though each new N adds a quantum of incompressibility.
Stephen C