views:

273

answers:

4

I'm currently looking at a simple programming problem that might be fun to optimize - at least for anybody who believes that programming is art :) So here is it:

How to best represent long's as Strings while keeping their natural order?

Additionally, the String representation should match ^[A-Za-z0-9]+$. (I'm not too strict here, but avoid using control characters or anything that might cause headaches with encodings, is illegal in XML, has line breaks, or similar characters that will certainly cause problems)

Here's a JUnit test case:

@Test
public void longConversion() {
    final long[] longs = { Long.MIN_VALUE, Long.MAX_VALUE, -5664572164553633853L,
            -8089688774612278460L, 7275969614015446693L, 6698053890185294393L,
            734107703014507538L, -350843201400906614L, -4760869192643699168L,
            -2113787362183747885L, -5933876587372268970L, -7214749093842310327L, };

    // keep it reproducible
    //Collections.shuffle(Arrays.asList(longs));

    final String[] strings = new String[longs.length];
    for (int i = 0; i < longs.length; i++) {
        strings[i] = Converter.convertLong(longs[i]);
    }

    // Note: Comparator is not an option
    Arrays.sort(longs);
    Arrays.sort(strings);

    final Pattern allowed = Pattern.compile("^[A-Za-z0-9]+$");
    for (int i = 0; i < longs.length; i++) {
        assertTrue("string: " + strings[i], allowed.matcher(strings[i]).matches());
        assertEquals("string: " + strings[i], longs[i], Converter.parseLong(strings[i]));
    }
}

and here are the methods I'm looking for

public static class Converter {
    public static String convertLong(final long value) {
        // TODO
    }

    public static long parseLong(final String value) {
        // TODO
    }
}

I already have some ideas on how to approach this problem. Still, I though I might get some nice (creative) suggestions from the community.

Additionally, it would be nice if this conversion would be

  • as short as possible
  • easy to implement in other languages

EDIT: I'm quite glad to see that two very reputable programmers ran into the same problem as I did: using '-' for negative numbers can't work as the '-' doesn't reverse the order of sorting:

  1. -0001
  2. -0002
  3. 0000
  4. 0001
  5. 0002
+2  A: 

EDIT: Okay, so just adding the negative sign for negative numbers doesn't work... but you could convert the value into an effectively "unsigned" long such that Long.MIN_VALUE maps to "0000000000000000", and Long.MAX_VALUE maps to "FFFFFFFFFFFFFFFF". Harder to read, but will get the right results.

Basically you just need to add 2^63 to the value before turning it into hex - but that may be a slight pain to do in Java due to it not having unsigned longs... it may be easiest to do using BigInteger:

private static final BigInteger OFFSET = BigInteger.valueOf(Long.MIN_VALUE)
                                                   .negate();

public static String convertLong(long value) {
    BigInteger afterOffset = BigInteger.valueOf(value).add(OFFSET);
    return String.format("%016x", afterOffset);
}

public static long parseLong(String text) {
    BigInteger beforeOffset = new BigInteger(text, 16);
    return beforeOffset.subtract(OFFSET).longValue();
}

That wouldn't be terribly efficient, admittedly, but it works with all your test cases.

Jon Skeet
Doesn't work - see my EDIT (this was blasphemy, wasn't it?)
sfussenegger
@sfussenegger: Edited... have a look.
Jon Skeet
Yeah, that was the path I was following too. Sounds promising, I'm waiting for some code :) btw: I've already changed -1 to +1, although I wanted 0, but hit the wrong arrow, now SO won't let me change it ("Vote too old to be changed, unless this answer is edited") ... weird.
sfussenegger
@Jon: heh, you and I had the same idea. Except I've got code. :-)
cletus
@cletus: In what way do I not have code? :) (And yes, it was there before your comment, too :)
Jon Skeet
@Jon: see now if you had to process an array of 2 billion longs my code would be faster. :)
cletus
now it works, great. I'm going for cletus' solution though, as it is shorter. cheers
sfussenegger
@sfussenegger: I've pinched cletus's format string which makes mine even shorter :)
Jon Skeet
@Jon now it's getting kinda difficult do decide. In term of length both solutions are similar. At the same time, cletus's should be faster but yours is tolerant on shorter input (i.e. non 16 character strings) ... I think it's gonna be a sleepless night :)
sfussenegger
@sfussenegger: I'd argue that mine being tolerant of invalid input is actually a bad thing. For production code I'd probably make it reject that :) Oh for an unsigned long type, which would make life so much simpler.
Jon Skeet
@Jon But I'd bet you'd reject it with something more meaningful than an IndexOutOfBoundsException, wouldn't you? :)
sfussenegger
+11  A: 

Ok, take two:

class Converter {
  public static String convertLong(final long value) {
    return String.format("%016x", value - Long.MIN_VALUE);
  }

  public static long parseLong(final String value) {
    String first = value.substring(0, 8);
    String second = value.substring(8);
    long temp = (Long.parseLong(first, 16) << 32) | Long.parseLong(second, 16);
    return temp + Long.MIN_VALUE;
  }
}

This one takes a little explanation. Firstly, let me demonstrate that it is reversible and the resultant conversions should demonstrate the ordering:

for (long aLong : longs) {
  String out = Converter.convertLong(aLong);
  System.out.printf("%20d %16s %20d\n", aLong, out, Converter.parseLong(out));
}

Output:

-9223372036854775808 0000000000000000 -9223372036854775808
 9223372036854775807 ffffffffffffffff  9223372036854775807
-5664572164553633853 316365a0e7370fc3 -5664572164553633853
-8089688774612278460 0fbba6eba5c52344 -8089688774612278460
 7275969614015446693 e4f96fd06fed3ea5  7275969614015446693
 6698053890185294393 dcf444867aeaf239  6698053890185294393
  734107703014507538 8a301311010ec412   734107703014507538
 -350843201400906614 7b218df798a35c8a  -350843201400906614
-4760869192643699168 3dedfeb1865f1e20 -4760869192643699168
-2113787362183747885 62aa5197ea53e6d3 -2113787362183747885
-5933876587372268970 2da6a2aeccab3256 -5933876587372268970
-7214749093842310327 1be00fecadf52b49 -7214749093842310327

As you can see Long.MIN_VALUE and Long.MAX_VALUE (the first two rows) are correct and the other values basically fall in line.

What is this doing?

Assuming signed byte values you have:

  • -128 => 0x80
  • -1 => 0xFF
  • 0 => 0x00
  • 1 => 0x01
  • 127 => 0x7F

Now if you add 0x80 to those values you get:

  • -128 => 0x00
  • -1 => 0x7F
  • 0 => 0x80
  • 1 => 0x81
  • 127 => 0xFF

which is the correct order (with overflow).

Basically the above is doing that with 64 bit signed longs instead of 8 bit signed bytes.

The conversion back is a little more roundabout. You might think you can use:

return Long.parseLong(value, 16);

but you can't. Pass in 16 f's to that function (-1) and it will throw an exception. It seems to be treating that as an unsigned hex value, which long cannot accommodate. So instead I split it in half and parse each piece, combining them together, left-shifting the first half by 32 bits.

cletus
I would use `String.format`... Ohh, and missing the thread (un)safe disclaimer
Carlos Heuberger
@Int: You could use `String.format` but you need something to parse it to convert the other way anyway so you may as well do it with one method.
cletus
Easy, eh? Doesn't work - see my EDIT ;)
sfussenegger
Funny enough people keep voting for this answer although it's already proved wrong.
sfussenegger
@sfuss: it's corrected now. Just a tip for your future endeavours however: downvoting people giving you good faith answers will usually mean they simply won't help you. If I'd been at the rep cap or simply wasn't interested, normally I'd just delete my answer and move on in the face of what it is basically ingratitude.
cletus
I just followed the exact same path to come up with a solution. It's nearly the same, the unit test still fails for your solution though: "string: 0000000000000000 expected:<-9223372036854775808> but was:<9223372036854775807>". Simply change to "value - Long.MIN_VALUE" and "temp - Long.MIN_VALUE" respectively and we're there.
sfussenegger
@cletus Your answer was wrong, hence the down vote. Nevertheless, after you've fixed it, I changed it to an upvote and accepted your answer. Isn't that just how it is supposed to work?
sfussenegger
@cletus: I've now pinched your format string for my BigInteger answer. Gets rid of that horrible padding :)
Jon Skeet
@cletus ... and I've got a down vote for this question without getting any kind of explanation at all. Sometimes I wonder what SO would be like if you'd see who voted ... eye for an eye *muahahaha* Anywah, thanks for your effort! cheers
sfussenegger
@sfuss: perhaps it's just me and I really don't mean it like I'm annoyed. It's merely advice on getting good responses. Personally I don't downvote things if they have a typo or someone's put in a good faith answer and just happen to be wrong. A downvote is a bit like a slap on the wrist. Slaps on the wrist don't tend to engender further help when you're asking someone for a favour however. But maybe that's just me.
cletus
It wasn't a typo, it was an error in reasoning (additionally I accepted and edited your answer although the unit test still failed). So I though I might encourage you to revisit the problem by down voting ... and looking at the final solution I think I was right ;) Additionally, it doesn't start with "This is a pretty easy problem to solve." anymore which was a slight "slap on the wrist" too ;)
sfussenegger
A: 

If you don't need a printable String, you can encode the long in four chars after you've shifted the value by Long.MIN_VALUE (-0x80000000) to emulate an unsigned long:

public static String convertLong(long value) {
    value += Long.MIN_VALUE;
    return "" + 
        (char)(value>>48) + (char)(value>>32) + 
        (char)(value>>16) + (char)value; 
}

public static long parseLong(String value) {
    return (
        (((long)value.charAt(0))<<48) + 
        (((long)value.charAt(1))<<32) + 
        (((long)value.charAt(2))<<16) + 
        (long)value.charAt(3)) + Long.MIN_VALUE;
}

Usage of surrogate pairs is not a problem, since the natural order of a string is defined by the UTF-16 values in its chars and not by the UCS-2 codepoint values.

jarnbjo
Sorry, but while this does work, it's using non-printable characters which isn't an option. I've edited my question to include this requirement.
sfussenegger
So before I spend time thinking about another solution, would you mind sharing your definition of 'printable' with us? Are that all characters with a valid Unicode code point, ISO-8859-1 or perhaps just ASCII? Is a space character printable or not?
jarnbjo
I'd love to see something that doesn't cause headaches with possible encoding problems, fits in one line and plays nicely with representations in XML or CSV. [A-Za-Z0-9] should be a sane choice, but add some more if you need them. I'm not too strict on this requirement. For instance, using 0x00 as char will certainly cause headaches and must be avoided.
sfussenegger
A: 

There's a technique in RFC2550 -- an April 1st joke RFC about the Y10K problem with 4-digit dates -- that could be applied to this purpose. Essentially, each time the integer's string representation grows to require another digit, another letter or other (printable) character is prepended to retain desired sort-order. The negative rules are more arcane, yielding strings that are harder to read at a glance... but still easy enough to apply in code.

Nicely, for positive numbers, they're still readable.

See:

http://www.faqs.org/rfcs/rfc2550.html

gojomo