views:

1183

answers:

12

Here is the scenario.

I am given an array 'A' of integers. The size of the array is not fixed. The function that I am supposed to write may be called once with an array of just a few integers while another time, it might even contain thousands of integers. Additionally, each integer need not contain the same number of digits.

I am supposed to 'sort' the numbers in the array such that the resulting array has the integers ordered in a lexicographic manner (i.e they are sorted based on their string representations. Here "123" is the string representation of 123). Please note that the output should contain integers only, not their string equivalents.

For example: if the input is:

[ 12 | 2434 | 23 | 1 | 654 | 222 | 56 | 100000 ]

Then the output should be:

[ 1 | 100000 | 12 | 222 | 23 | 2434 | 56 | 654 ]

My initial approach: I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits (this was the messy step as it involved tracking etc making the solution very inefficient) and then did radix sort. Finally, I removed the padded zeros, converted the strings back to their integers and put them in the resulting array. This was a very inefficient solution.

I've been led to believe that the solution doesn't need padding etc and there is a simple solution where you just have to process the numbers in some way (some bit processing?) to get the result.

What is the space-wise most efficient solution you can think of? Time-wise?

If you are giving code, I'd prefer Java or pseudo-code. But if that doesn't suit you, any such language should be fine.

+3  A: 

Executable pseudo-code (aka Python): thenumbers.sort(key=str). Yeah, I know that using Python is kind of like cheating -- it's just too powerful;-). But seriously, this also means: if you can sort an array of strings lexicographically, as Python's sort intrinsically can, then just make the "key string" out of each number and sort that auxiliary array (you can then reconstruct the desired numbers array by a str->int transformation, or by doing the sort on the indices via indirection, etc etc); this is known as DSU (Decorate, Sort, Undecorate) and it's what the key= argument to Python's sort implements.

In more detail (pseudocode):

  1. allocate an array of char** aux as long as the numbers array
  2. for i from 0 to length of numbers-1, aux[i]=stringify(numbers[i])
  3. allocate an array of int indices of the same length
  4. for i from 0 to length of numbers-1, indices[i]=i
  5. sort indices, using as cmp(i,j) strcmp(aux[i],aux[j])
  6. allocate an array of int results of the same length
  7. for i from 0 to length of numbers-1, results[i]=numbers[indices[i]]
  8. memcpy results over numbers
  9. free every aux[i], and also aux, indices, results
Alex Martelli
That's cool. However, I am in search of an algorithm, than a way to do it in a particular language. :)
Skylark
...and aren't the steps 1 to 9 I list under "in more detail" enough of "an algorithm" for you...?
Alex Martelli
When I first left the comment, I suppose only the first two lines were there. :) You added the pseudocode algorithm later, didn't you? :) Now, those steps are helpful. Thanks!
Skylark
+2  A: 

My temptation would be to say that the int to string conversion would happen in the comparitor code rather than in bulk. Although this may be more elegant from a code-perspective I'd have to say that the execution effort would be greater as each number may be compared several times.

I'd be inclined to create a new array containing both the int and string representation (not sure that you need to pad the string versions for the string comparison to produce the order you've given), sort that on the string and then copy the int values back to the original array.

I can't think of a smart mathematically way of sorting this as by your own statement you want to sort lexicographically so you need to transform the numbers to strings to do that.

Lazarus
+3  A: 

I'd just turn them into strings, and then sort then sort using strcmp, which does lex comparisons.

Alternatively you can write a "lexcmp" function that compares two numbers using % 10 and /10 but that's basically the same thing as calling atoi many times, so not a good idea.

Brian Postow
Do you mean convert the whole array into an array of strings, or just do the conversion when the comparison is made?
A. Levy
you can do either, but I convert the whole array so you only do it once. Otherwise, you have to convert each number many (log n) times which is expensive...
Brian Postow
Converting log n times is not expensive if you are reading your data from CPU cache or registers (if you are on a register rich architecture). Perhaps you are right, but I've run into situtations where doing more work from data in the cache beats preprocessing the array.
A. Levy
A: 

If you're going for space-wise efficiency, I'd try just doing the work in the comparison function of the sort

int compare(int a, int b) {
   // convert a to string
   // convert b to string
   // return -1 if a < b, 0 if they are equal, 1 if a > b
}

If it's too slow (it's slower than preprocessing, for sure), keep track of the conversions somewhere so that the comparison function doesn't keep having to do them.

Lou Franco
+2  A: 

You definitely don't need to pad the result. It will not change the order of the lexicographical compare, it will be more error prone, and it will just waste CPU cycles. The most "space-wise" efficient method would be to convert the numbers to strings when they are compared. That way, you would not need to allocate an additional array, the numbers would be compared in place.

You can get a reasonably good implementation quickly by just converting them to strings as needed. Stringifying a number isn't particularly expensive and, since you are only dealing with two strings at a time, it is quite likely that they will remain in the CPU cache at all times. So the comparisons will be much faster than the case where you convert the entire array to strings since they will not need to be loaded from main memory into the cache. People tend to forget that a CPU has a cache and that algorithms which do a lot of their work in a small local area of memory will benefit greatly from the much faster cache access. On some architectures, the cache is so much faster than the memory that you can do hundreds of operations on your data in the time it would have taken you to load it from main memory. So doing more work in the comparison function could actually be significantly faster than pre-processing the array. Especially if you have a large array.

Try doing the string serialization and comparison in a comparator function and benchmark that. I think it will be a pretty good solution. Example java-ish pseudo-code:

public static int compare(Number numA, Number numB) {
    return numA.toString().compare(numB.toString());
}

I think that any fancy bit wise comparisons you could do would have to be approximately equivalent to the work involved in converting the numbers to strings. So you probably wouldn't get significant benefit. You can't just do a direct bit for bit comparison, that would give you a different order than lexicographical sort. You'll need to be able to figure out each digit for the number anyway, so it is most straightforward to just make them strings. There may be some slick trick, but every avenue I can think of off the top of my head is tricky, error-prone, and much more work than it is worth.

A. Levy
As an afterthought. This may also depend greatly on the language you are using. In C this is probably true. In more dynamic languages, there may be enough overhead in calling the comparison function to overwhelm the cache benefit.
A. Levy
+1  A: 

The actual sorting can be done by any algorithm you like. The key to this problem is finding the comparison function that will properly identify which numbers should be "less than" others, according to this scheme:

bool isLessThan(int a, int b)
{
    string aString = ToString(a);
    string bString = ToString(b);

    int charCount = min(aString.length(), bString.length())
    for (charIndex = 0; charIndex < charCount; charIndex++)
    {
        if (aString[charIndex] < bString[charIndex]) { return TRUE; }
    }

    // if the numbers are of different lengths, but identical
    // for the common digits (e.g. 123 and 12345)
    // the shorter string is considered "less"
    return (aString.length() < bString.length());
}
e.James
That was a neat comparison, thanks. This, in combination with the batch conversion to strings, then sorting, should probably be a best solution if nothing else works out.
Skylark
Ah, ok.. I'll decide batch conversion or not after all the answers are in.
Skylark
Batch string conversion should improve things substantially. I don't know of a sorting function that performs better than O(n), which means that the conversion to a string will have to happen for each node more than once. My comparison function would even do each one twice! Given the number of divisions required to convert an integer to a string, I'd be surprised if the string conversion wasn't your bottleneck.
e.James
A: 

Possible optimization: Instead of this:

I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits

you can multiply each number by (10^N - log10(number)), N being a number larger than log10 of any of your numbers.

Igor Oks
Although this would have 1 comparing equal to 100, for example (as does the original).
Dave Hinton
A: 
#!/usr/bin/perl

use strict;
use warnings;

my @x = ( 12, 2434, 23, 1, 654, 222, 56, 100000 );

print $_, "\n" for sort @x;

__END__

Some timings ... First, with empty @x:

C:\Temp> timethis s-empty
TimeThis :  Elapsed Time :  00:00:00.188

Now, with 10,000 randomly generated elements:

TimeThis :  Elapsed Time :  00:00:00.219

This includes the time taken to generate the 10,000 elements but not the time to output them to the console. The output adds about a second.

So, save some programmer time ;-)

Sinan Ünür
Hey, cool.. thanks for the numbers.
Skylark
+2  A: 

Since you mentioned Java is the actual language in question:

You don't need to convert to and from strings. Instead, define your own comparator and use that in the sort.

Specifically:

Comparator<Integer> lexCompare = new Comparator<Integer>(){
   int compareTo( Integer x, Integer y ) {
      return x.toString().compareTo( y.toString() );
   }
};

Then you can sort the array like this:

int[] array = /* whatever */;
Arrays.sort( array, lexCompare );

(Note: The int/Integer mismatch works automatically through auto-boxing)

Jason Cohen
A: 

Pseudocode:

sub sort_numbers_lexicographically (array) {
    for 0 <= i < array.length:
        array[i] = munge(array[i]);
    sort(array);  // using usual numeric comparisons
    for 0 <= i < array.length:
        array[i] = unmunge(array[i]);
}

So, what are munge and unmunge?

munge is different depending on the integer size. For example:

sub munge (4-bit-unsigned-integer n) {
    switch (n):
        case 0:  return 0
        case 1:  return 1
        case 2:  return 8
        case 3:  return 9
        case 4:  return 10
        case 5:  return 11
        case 6:  return 12
        case 7:  return 13
        case 8:  return 14
        case 9:  return 15
        case 10:  return 2
        case 11:  return 3
        case 12:  return 4
        case 13:  return 5
        case 14:  return 6
        case 15:  return 7
}

Esentially what munge is doing is saying what order 4 bit integers come in when sorted lexigraphically. I'm sure you can see that there is a pattern here --- I didn't have to use a switch --- and that you can write a version of munge that handles 32 bit integers reasonably easily. Think about how you would write versions of munge for 5, 6, and 7 bit integers if you can't immediately see the pattern.

unmunge is the inverse of munge.

So you can avoid converting anything to a string --- you don't need any extra memory.

Dave Hinton
A: 

If you want to try a better preprocess-sort-postprocess, then note that an int is at most 10 decimal digits (ignoring signed-ness for the time being).

So the binary-coded-decimal data for it fits in 64 bits. Map digit 0->1, 1->2 etc, and use 0 as a NUL terminator (to ensure that "1" comes out less than "10"). Shift each digit in turn, starting with the smallest, into the top of a long. Sort the longs, which will come out in lexicographical order for the original ints. Then convert back by shifting digits one at a time back out of the top of each long:

uint64_t munge(uint32_t i) {
    uint64_t acc = 0;
    while (i > 0) {
        acc = acc >> 4;
        uint64_t digit = (i % 10) + 1;
        acc += (digit << 60);
        i /= 10;
    }
    return acc;
}

uint32_t demunge(uint64_t l) {
    uint32_t acc = 0;
    while (l > 0) {
        acc *= 10;
        uint32_t digit = (l >> 60) - 1;
        acc += digit;
        l << 4;
    }
}

Or something like that. Since Java doesn't have unsigned ints, you'd have to modify it a little. It uses a lot of working memory (twice the size of the input), but that's still less than your initial approach. It might be faster than converting to strings on the fly in the comparator, but it uses more peak memory. Depending on the GC, it might churn its way through less memory total, though, and require less collection.

Steve Jessop
A: 

One really hacky method (using C) would be:

  • generate a new array of all the values converted to floats
  • do a sort using the mantissa (significand) bits for the comparison

In Java (from here):

long bits = Double.doubleToLongBits(5894.349580349);

boolean negative = (bits & 0x8000000000000000L) != 0; 
long exponent = bits & 0x7ff0000000000000L >> 52;
long mantissa = bits & 0x000fffffffffffffL;

so you would sort on the long mantissa here.

Phil H
That sorts them lexicographically in base 2 (assuming no loss of precision). Questioner wants them sorted lexicographically in base 10. So what you said, but using BigDecimal, could be a winner. Probably not (much) faster than String, though.
Steve Jessop