views:

79

answers:

2

This question is a combination of regex practice and unit testing practice.

Regex part

I authored this problem separateThousands for personal practice:

Given a number as a string, introduce commas to separate thousands. The number may contain an optional minus sign, and an optional decimal part. There will not be any superfluous leading zeroes.

Here's my solution:

String separateThousands(String s) {
  return s.replaceAll(
      String.format("(?:%s)|(?:%s)",
        "(?<=\\G\\d{3})(?=\\d)",
        "(?<=^-?\\d{1,3})(?=(?:\\d{3})+(?!\\d))"
      ),
      ","
  );
}

The way it works is that it classifies two types of commas, the first, and the rest. In the above regex, the rest subpattern actually appears before the first. A match will always be zero-length, which will be replaceAll with ",".

The rest basically looks behind to see if there was a match followed by 3 digits, and looks ahead to see if there's a digit. It's some sort of a chain reaction mechanism triggered by the previous match.

The first basically looks behind for ^ anchor, followed by an optional minus sign, and between 1 to 3 digits. The rest of the string from that point must match triplets of digits, followed by a nondigit (which could either be $ or \.).

My question for this part is:

  • Can this regex be simplified?
  • Can it be optimized further?
    • Ordering rest before first is deliberate, since first is only needed once
    • No capturing group

Unit testing part

As I've mentioned, I'm the author of this problem, so I'm also the one responsible for coming up with testcases for them. Here they are:

INPUT, OUTPUT
"1000", "1,000"
"-12345", "-12,345"
"-1234567890.1234567890", "-1,234,567,890.1234567890"
"123.456", "123.456"
".666666", ".666666"
"0", "0"
"123456789", "123,456,789"
"1234.5678", "1,234.5678"
"-55555.55555", "-55,555.55555"
"0.123456789", "0.123456789"
"123456.789", "123,456.789"

I haven't had much experience with industrial-strength unit testing, so I'm wondering if others can comment whether this is a good coverage, whether I've missed anything important, etc (I can always add more tests if there's a scenario I've missed).

+1  A: 

When you state the requirements are you intending for them to be enforced by your method?

The number may contain an optional minus sign, and an optional decimal part. There will not be any superfluous leading zeroes.

If your intent is to have the method detect when those constraints are violated you will need additional to write additional unit-tests to ensure that contract is being enforced.

What about testing for 1234.5678.91011?

Do you expect your method to return 1,234.5678.91011 or just ignore the whole thing? Best to write a test to verify your expectations

nvuono
These are good questions. I'm not really sure what I'm supposed to do with unit tests. Right now, if the input is "invalid" (by whose definition, I'm not sure), I just scream "undefined behavior!". I'm not sure if that's the proper way to unit test. Maybe I should just read a book.
polygenelubricants
We're really just relying on the underlying RegEx engine to find the matching text and apply the transformation where appropriate. I guess the most important testing would ensure you don't malform text by adding commas where they don't belong.
nvuono
+1  A: 

This works for me:

return s.replaceAll("(\\G-?\\d{1,3})(?=(?:\\d{3})++(?!\\d))", "$1,");

The first time through, \G acts the same as ^, and the lookahead forces \d{1,3} to consume only as many characters as necessary to leave the match position at a three-digit boundary. After that, \d{1,3} consumes the maximum three digits every time, with \G to keep it anchored to the end of the previous match.

As for your unit tests, I would just make it clear in the problem description that the input will always be valid number, with at most one decimal point.

Alan Moore
@Alan: any significance to the possessive quantifier?
polygenelubricants
@poly: Not really. It makes the lookahead infinitesimally more efficient, but the regex won't fail without it. I just like to use them whenever I know indeterminacy is not in my interest.
Alan Moore
@Alan: I haven't done any testing, but my gut tells me that my regex is faster since it doesn't use any capturing group, `\0` is empty, and the variable-length lookahead _first_ is needed less often (the more frequent _rest_ is simpler in comparison). Will do some testing and report tomorrow (need sleep).
polygenelubricants
@Alan: By the way, if it's not obvious from the recent developments, I no longer see regex as this magical flute to make beautiful (sometimes orgasmic!) music with. I now see it as just a tool, which is why I picked a more "real" example (separating thousands), and why I'm more concerned with performance etc rather than conciseness, elegance, and all that first love blindness stuff =)
polygenelubricants
@poly: If performance were a real concern, not using captures *could* make a significant difference--good point there. And that would apply even if I had left out the parens and used `$0` instead of `$1`, which I could have done. Most of the cost would come from parsing the group references and building the replacement string; I would expect that to dwarf any difference in *regex* performance.
Alan Moore
@poly: As for the attitude progression: good on you! But I hope you don't lose your enthusiasm completely; your solution to the `repeatEnd` problem (http://stackoverflow.com/questions/2606214/codingbat-repeatend-using-regex) was bloody brilliant.
Alan Moore