tags:

views:

72

answers:

2

Hi,

In Java, on a text like foo <on> bar </on> thing <on> again</on> now, I should want a regex with groups wich give me with a find "foo", "bar", empty string, then "thing", "again", "now".

If I do (.*?)<on>(.*?)</on>(?!<on>), I get only two group (foo bar, thing again, and I've not the end "now").

if I do (.*?)<on>(.*?)</on>((?!<on>)) I get foo bar empty string, then thing again and empty string (here I should want "now").

Please what is the magical formula ?

Thanks.

A: 

My recommendations

  • there is no need to match text before <on> and after </on>
  • use non greedy flags to match text between <on> and next </on>
  • use a loop with Matcher.find() to sequence through all occurences, if possible. No need to do all at once with one big fat regexp!
Ingo
OK, I do that. Thanks.
Istao
Fine. Your program will be more readable and maintenable that way.
Ingo
+2  A: 

If you insist on doing this with regex, then you can try to use \s*<[^>]*>\s* as delimiter:

    String text = "foo <on> bar </on> thing <on> again</on> now";
    String[] parts = text.split("\\s*<[^>]*>\\s*");
    System.out.println(java.util.Arrays.toString(parts));
    // "[foo, bar, thing, again, now]"

I'm not sure if this is exactly what you need, because it's not exactly clear.


Perhaps something like this was required:

    String text = "1<on>2</on>3<X>4</X>5<X>6</X>7<on>8</on><X>9</X>10";
    String[] parts = text.split("\\s*</?on>\\s*|<[^>]*>[^>]*>");
    System.out.println(java.util.Arrays.toString(parts));
    // prints "[1, 2, 3, 5, 7, 8, , 10]"

This doesn't handle nested tags. If you have those, you'd really want to dump regex and use an actual HTML parser.

If you don't want the empty string in the middle of the array, then just (?:delimiter)+.

    String text = "1<on>2</on>3<X>4</X>5<X>6</X>7<on>8</on><X>9</X>10";
    String[] parts = text.split("(?:\\s*</?on>\\s*|<[^>]*>[^>]*>)+");
    System.out.println(java.util.Arrays.toString(parts));
    // prints "[1, 2, 3, 5, 7, 8, 10]"
polygenelubricants
No, sorry, I want to catch <on> and only <on>, but not <in> for instance.
Istao
@Istao: Still not clear. Why do you need `foo` and `thing`, then? Edit question with A LOT MORE examples.
polygenelubricants