tags:

views:

77

answers:

2

Using Java (1.6) I want to split an input string that has components of a header, then a number of tokens. Tokens conform to this format: a ! char, a space char, then a 2 char token name (from constrained list e.g. C0 or 04) and then 5 digits. I have built a pattern for this, but it fails for one token (CE) unless I remove the requirement for the 5 digits after the token name. Unit test explains this better than I could (see below)

Can anyone help with what's going on with my failing pattern? The input CE token looks OK to me...

Cheers!

@Test
public void testInputSplitAnomaly() {
    Pattern pattern = Pattern.compile("(?=(! [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]\\d{5}))");
    splitByRegExp(pattern);
}
@Test
public void testInputSplitWorks() {
    Pattern pattern = Pattern.compile("(?=(! [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]))");
    splitByRegExp(pattern);
}


public void splitByRegExp(Pattern pattern) {
    String input = "& 0000800429! C600080 123456789-! C000026 213  00300! 0400020 A1Y1! Q200002 13! CE00202 01 ! Q600006 020507! C400012 O00511011";
    String[] tokens = pattern.split(input);
    Arrays.sort(tokens);
    System.out.println("-----------------------------");
    for (String token : tokens) {
        System.out.println(token.substring(0,11));
    }
    assertThat(tokens,Matchers.hasItemInArray(startsWith("! CE")));
    assertThat(tokens.length,is(8));
}
+1  A: 

This doesn't make any sense:

[04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]

I believe you want:

(?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE)

Square brackets are only used for character classes, not general grouping. Use (?:...) or (...) for general grouping (the latter also captures).

Laurence Gonsalves
+1  A: 

I think that your mistake here is your use of square brackets. Don't forget that these indicate a character class, so [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE] doesn't do what you expect it to.

What it does do is the following:

  • [04|C0|Q2|Q6|C4|B[2-6] constitutes a character class, matching one of: |, [, 0, 2, 3, 4, 5, 6, B, C or Q,
  • the rest is interpreted as listing a set of alternatives, specificially the character class mentioned above, or Q[8-9] *or * C6 *or * CE]. That is why the CE doesn't work, because it does not have a square bracket with it.

What you are probably after is (?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE)

Tim
@ anyone who may care: [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE] should be equivalent to [04CQ26B[2-6]+trailingJunkNotInTheCharacterClass (note that it collapses to a character *set* and ] is not protected at a random location internally)
pst
Oops, of course that is missing a |.
pst
A late thought: wouldn't the first `[` match the first `]`, so the character class would be `[04|C0|Q2|Q6|C4|B[2-6]` leaving the next chunk to be interpreted as an alternation of `Q[8-9]` or `C6` or `CE]`.
Tim
Now added to answer.
Tim
@Tim: In Java regexes, the square brackets do "nest" - for example, `[B[2-6]Q[8-9]]` is the same as `[B2-6Q8-9]`. This is not true of all regex flavors (it may be unique to Java for all I know).
Alan Moore