tags:

views:

282

answers:

6

I have a String of the format "[(1, 2), (2, 3), (3, 4)]", with an arbitrary number of elements. I'm trying to split it on the commas separating the coordinates, that is, to retrieve (1, 2), (2, 3), and (3, 4).

Can I do it in Java regex? I'm a complete noob but hoping Java regex is powerful enough for it. If it isn't, could you suggest an alternative?

+5  A: 

You can use String#split() for this.

String string = "[(1, 2), (2, 3), (3, 4)]";
string = string.substring(1, string.length() - 1); // Get rid of braces.
String[] parts = string.split("(?<=\\))(,\\s*)(?=\\()");
for (String part : parts) {
    part = part.substring(1, part.length() - 1); // Get rid of parentheses.
    String[] coords = part.split(",\\s*");
    int x = Integer.parseInt(coords[0]);
    int y = Integer.parseInt(coords[1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

The (?<=\\)) positive lookbehind means that it must be preceded by ). The (?=\\() positive lookahead means that it must be suceeded by (. The (,\\s*) means that it must be splitted on the , and any space after that. The \\ are here just to escape regex-specific chars.

That said, the particular String is recognizeable as outcome of List#toString(). Are you sure you're doing things the right way? ;)

Update as per the comments, you can indeed also do the other way round and get rid of non-digits:

String string = "[(1, 2), (2, 3), (3, 4)]";
String[] parts = string.split("\\D.");
for (int i = 1; i < parts.length; i += 3) {
    int x = Integer.parseInt(parts[i]);
    int y = Integer.parseInt(parts[i + 1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

Here the \\D means that it must be splitted on any non-digit (the \\d stands for digit). The . after means that it should eliminate any blank matches after the digits. I must however admit that I'm not sure how to eliminate blank matches before the digits. I'm not a trained regex guru yet. Hey, Bart K, can you do it better?

After all, it's ultimately better to use a parser for this. See Huberts answer on this topic.

BalusC
There are commas in the substrings as well... You can `string.split("),");`, and after this to bring back the `)`.
Y. Shoham
Oops, didn't notice that .. Updated answer.
BalusC
Well spotted! I'm trying to reproduce a list of coordinates from, ahem, a List<Coordinate> effectively.
Beau Martínez
@Beau, and you have no reference to that List any more? It is a bit brittle to create it from the output of a `toString()` return...
Bart Kiers
@Bart If only! I'm retreiving Strings representing a series of moves from a game via a web service. Strong typing FTW!
Beau Martínez
:) (15 char fill)
Bart Kiers
@Beau, I now see what you need. I added a few more lines to get the coords out.
BalusC
Great stuff. This tempted me to mess around with Regex expressions and I came up with \([0-9], [0-9]\) to NOT include anything that has the form of coordinates. It would be nice to get it working with a negative lookaround as explained in this link:http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word
James P.
Whau, didn't know you could do *that* with a regular expression! Guess I need fetch "Mastering Regular Expressions" from the shelf and read up on this stuff :)
Jørn Schou-Rode
That being said, in the particular case of parsing coordinates, I would recommend the simpler/more comprehensible solution from my answer or the `Scanner` solution suggested by Hubert.
Jørn Schou-Rode
Yes, that kind of strings are after all indeed better to be parsed/tokenized.
BalusC
+1  A: 

Will there alwasy be 3 groups of coordinates? You could try:

\[(\(\d,\d\)), (\(\d,\d\)), (\(\d,\d\))\]

FrustratedWithFormsDesigner
Not necessarily! I'll edit the question; cheers on the quick reply. I'm assuming some ?*+ quantifiers will do the trick from there?
Beau Martínez
+3  A: 

If you do not require the expression to validate the syntax around the coordinates, this should do:

\(\d+,\s\d+\)

This expression will return several matches (three with the input from your example).

In your question, you state that you want to "retreive (1, 2), (2, 3), and (3, 4). In the case that you actually need the pair of values associated with each coordinate, you can drop the parentheses and modify the regex to do some captures:

(\d+),\s(\d+)

The Java code will look something like this:

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\d+),\\s(\\d+)");
        Matcher matcher = pattern.matcher("[(1, 2), (2, 3), (3, 4)]");

        while (matcher.find()) {
            int x = Integer.parseInt(matcher.group(1));
            int y = Integer.parseInt(matcher.group(2));
            System.out.printf("x=%d, y=%d\n", x, y);
        }
    }
}
Jørn Schou-Rode
All I get are the brackets! :/
Beau Martínez
I have added a Java code sample showing how to use the regex. Does this fail as well?
Jørn Schou-Rode
`Integer.parse(...)` does not work: it's `Integer.parseInt(...)`. I took the liberty to edit it and post a working example of your snippet.
Bart Kiers
The regex returns "), ("s and "("s and ")"s; I'm using String.split(), should I use Matcher and use groups instead?
Beau Martínez
Bart Kiers
@Beau: The regex I have posted will match the actual coordinates, so using it with `String.split()` will give you a lot of `), (` matches. The code sample in my answer should guide you on your way, though.
Jørn Schou-Rode
@Bart: Thanks for fixing my broken code :)
Jørn Schou-Rode
A: 

In regexes, you can split on (?<=\)), which use Positive Lookbehind:

string[] subs = str.replaceAll("\[","").replaceAll("\]","").split("(?<=\)),");

In simpe string functions, you can drop the [ and ] and use string.split("),"), and return the ) after it.

Y. Shoham
Your regex produces `(1`, `2), (2`, `3), (3` and `4)` on given example?
BalusC
Oops. I fixed from Negative to Positive. Now it should work.
Y. Shoham
The `"(?<=\\)),\\s*"` would be nicer as it covers spaces as well. In Java regex strings you by the way need to double-escape the \.
BalusC
Right again. :)
Y. Shoham
+1  A: 

If you use regex, you are going to get lousy error reporting and things will get exponentially more complicated if your requirements change (For instance, if you have to parse the sets in different square brackets into different groups).

I recommend you just write the parser by hand, it's like 10 lines of code and shouldn't be very brittle. Track everything you are doing, open parens, close parens, open braces & close braces. It's like a switch statement with 5 options (and a default), really not that bad.

For a minimal approach, open parens and open braces can be ignored, so there are really only 3 cases.


This would be the bear minimum.

// Java-like psuedocode
int valuea;
String lastValue;
tokens=new StringTokenizer(String, "[](),", true);

for(String token : tokens) {  

    // The token Before the ) is the second int of the pair, and the first should
    // already be stored
    if(token.equals(")"))
        output.addResult(valuea, lastValue.toInt());

    // The token before the comma is the first int of the pair
    else if(token.equals(",")) 
        valuea=lastValue.toInt();

    // Just store off this token and deal with it when we hit the proper delim
    else
        lastValue=token;
}

This is no better than a minimal regex based solution EXCEPT that it will be MUCH easier to maintain and enhance. (add error checking, add a stack for paren & square brace matching and checking for misplaced commas and other invalid syntax)

As an example of expandability, if you were to have to place different sets of square-bracket delimited groups into different output sets, then the addition is something as simple as:

    // When we close the square bracket, start a new output group.
    else if(token.equals("]"))
        output.startNewGroup();

And checking for parens is as easy as creating a stack of chars and pushing each [ or ( onto the stack, then when you get a ] or ), pop the stack and assert that it matches. Also, when you are done, make sure your stack.size() == 0.

Bill K
...You might be on to something here... Any chance you could mock-up some code?
Beau Martínez
This sounds like the event-driven approach SAX uses to parse XML. I suppose you'll need to go through the text character by character and build up a series of algorithms to detect various patterns.
James P.
+7  A: 

If you are ready for Java5

    Scanner sc = new Scanner(coords);
    sc.useDelimiter("\\D*"); // skip everything that is not a digit
    List<Coord> result = new ArrayList<Coord>();
    while (sc.hasNextInt()) {
        result.add(new Coord(sc.nextInt(), sc.nextInt()));
    }
    return result;

EDIT : we don't know how much coordinates are passed in the string 'coords'

Hubert
Nice solution! And if you replace `Coord` with `java.awt.Point` it compiles as it is.
Fabian Steeg
Watch out for negative values!
notnoop
@notnoop : true and as strange as it seems I couldn't succeed in using a delimiter pattern like <code>"[^-0-9]*"</code>, I had to use something less trivial like <code>"[^0-9]*[(),]\\s*"</code>. I'm on Sun JDK6.
Hubert
I love this! However as I asked for the regex I'll chose the best regex answer as the correct one for the sake of people with a similar question ;) 1 INTERNET FOR YOU
Beau Martínez