tags:

views:

203

answers:

3

I have not been able to write a regular expression, to use in a String.split (Java) expression such as to only split on comma which are not in parentheses.

Example:

(54654,4565):(45651,65423),4565:45651,(4565,4564):45651

Should yield the 3 strings:

  1. (54654,4565):(45651,65423)
  2. 4565:45651
  3. (4565,4564):45651

Any help much appreciated.

A: 

This works:

String regex = "((?<!\\d),)|(,(?!\\d))";

but presumes that you have something other than a number on one side of the comma or the other. So, it's not really looking to see if you're inside parens, so much as it's looking to see that it's not splitting on a comma that's only surrounded by digits.

As a result, if you're looking at this text:

"45651:65423,4565:45651"

then this solution fails (as an example). If you're more specific about what kinds of inputs you're expecting, we may be able to tailor our answers to your situation.

The expression language I have revolves around \\d:\\w pairs and looks like this (single escaped):(\d|\((\d(,\d)*\)):\d|\((\w(,\w)*\))(,(\d|\((\d(,\d)*\)):\d|\((\w(,\w)*\)))*)Example input:4565:dewpoint,4568:(temperature,dewpoint),(4565,4568):temperature,(4565,4568):(temperature,dewpoint)
A: 

Just a reminder that you need to be careful if there will be any nesting. Regex just isn't very good at this. Consider the following snippet:

(a,)b,(c,(d,)e,)

Based on your question, you would only want to match comma b. The trick is that expressions are generally either completely greedy or completely un-greedy, with little middle ground.

A greedy expression would see the ( at the very beginning of the segment and the ) at the very end and take everything inside them, regardless that there are closing parentheses elsewhere. Nothing would be matched.

An ungreedy expression would take only the smallest set possible, starting from the beginning. It would match comma b, but also see this segment as one unit: (c,(d,). Then it would proceed to also match comma e, because it's already taken the last (.

There are some engines that allow you to handle the nesting levels, but the expressions are usually ugly and hard to maintain: best to just avoid the feature unless you really understand it well.

Joel Coehoorn
Thankfully nesting is not really a requirement for me, I have only one level of parentheses at any given time. I've written a simple parser (recursive descent through production rules) but this problem strikes me as being solvable more elegantly with the right regular expression.
+2  A: 

You can do this with just a lookahead, which is easier to work with than lookbehind.

String[] parts = str.split(",(?![^()]*+\\))");

But the other responders are right: if you couldn't come up with this regex on your own, what will you do when the requirements change? You're probably better off with a long-winded solution that you actually understand.

Alan Moore
Thank's Alan, indeed it works! I asked this question here because of my limited experience with regex but a desire to learn. When my requirements change I will have a new tool at my disposal, not having been exposed before to lookbehind or lookahead (apart from in full fledged parser generators).
That's cool. Regexes are like a dirty little secret of programming. Anyone who goes around providing regex-based solutions gets in the habit of mentioning their limitations and pitfalls, because if we don't, someone else will. ;-)
Alan Moore