tags:

views:

221

answers:

5

Hi,

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:

,a,b,c,d,"("x","y",z)",e,f,g,

Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.

I tried a lot of combinations but regular expressions is still a little foggy for me.

I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.

So, want to do something like this:

String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g

Thanks!

+2  A: 

If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.

List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
  if ((i % 2) == 0) {
    String[] atoms = chunks[i].split(",");
    for (int j = 0; j < atoms.length; j++)
      result.add(atoms[j]);
  }
  else
    result.add(chunks[i]);
}
Adam Schmideg
Yes, the software is working actually, I'm just looking for something more concise, that's why I'm looking for regex. An up for your time though.
Alaor
+1  A: 

Well,

After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!

But I still looking for one that can found the commas even if there's no "" in the inside terms.

Thankz for the help guyz.

Alaor
+3  A: 

Try this one:

(?![^(]*\)),

It worked for me in my testing, grabbed all commas not inside parenthesis.

Edit: Gopi pointed out the need to escape the slashes in Java:

(?![^(]*\\)),

Edit: Alan Moore pointed out some unnecessary complexity. Fixed.

cnanney
this matches the commas, but I think he will match the values in between commas. Otherwise his split makes no sense
evildead
+1 in java you have to escape the slashes. so would be like (?![^(]*\\)),(?!=.*\\()
Gopi
The expression in your second lookahead, `=.*\(`, matches an equals sign, zero or more of anything, and a left parenthesis. Since there are no equals signs in the text, the negative lookahead always succeeds. It's the *first* lookahead that's doing all the work.
Alan Moore
Good eye, Alan - thanks for pointing that out!
cnanney
The second one was supposed to be a negative lookbehind, which I incorrectly had as `(?!=...)` instead of `(?<!...)`. But, as you pointed out, wasn't even necessary.
cnanney
+1  A: 

This should do what you want:

(".*")|([a-z])

I didnt check in java but if you test it with http://www.fileformat.info/tool/regex.htm the groups $1 and $2 contain the right values, so they match and you should get what you want. A littlte be trickier this will get if you have other complexer values than a-z in between the commas.

If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for. Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly. By inversing the problem itself, the problem gets often simpler.

evildead
+6  A: 

You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:

String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";

String[] parts = text.split(";(?![^<>]*>)");

System.out.println(java.util.Arrays.toString(parts));
//  _  _  _  _  _______  _  _  _  _________  _  _  _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]

Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.


On the pattern

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.

The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.

The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.

Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.

This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.

You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).

References

polygenelubricants
Nice! That's it.
Alaor
@Alaor: it should be said that adapting this from `<…>` to `"(…)"` is not trivial if you can have `(…)` and `"…"`, etc. Also, the lookahead here is variable length, meaning performance isn't the best. Scanning rather than splitting is probably the best option.
polygenelubricants
I can't, there's only one form: "( or )". Actually, I can check only the parenthesis, since no other terms can have it. So it's only change < ... > to ( ... ).
Alaor
I wouldn't expect performance to be an issue, but I would use a possessive quantifier anyway, just because I can: `;(?![^<]*+>)` ref: http://www.regular-expressions.info/possessive.html
Alan Moore
@Alan: good point on possessive, but unfortunately your pattern doesn't work, but it's my fault: I forgot to put the other parenthesis in the character class. I've fixed this in the answer, incorporating your possessive quantifier suggestion with the fixed pattern.
polygenelubricants