tags:

views:

48

answers:

4
+1  Q: 

Java Regex Problem

Hello,

I have a string that i am trying to extract patterns from, the string is as follows:

(  ELT2N ( ELTOK wpSA910 wpSA909 wpSA908 wpSA474 ) )

The problem is, i dont know how many of the strings beginning with 'wp' will be in the string i am trying to search, however i want toi extract all of them using one statement. I am currently using the pattern below:

private final static String STARS_LINE_PATTERN = "\\(\\s+?(\\w+?)\\s+?\\(\\s+(\\w+)\\s+?(\\w+?\\s??){1,}\\s+?\\)\\s+?\\)";

The pattern is matching the string and returning the 'ELT2N' and the 'ELTOK' strings but is not returning the strings prefixed by 'wp'.

Can anyone help?

Thanks

Simon

A: 

MvanGeest's comment is correct, if you use a quantifier on a capture group, only the last value is stored. Put simply if you do not know how many 'sets' there are then the overall process cannot be done in a single step. You would first have to match all of the wp preceded strings into a single pattern so that you have "ELT2N", "ELTOK", "wpSA910 wpSA909 wpSA908 wpSA474", you would then have to parse the last string independently to seperate the other values. I've not used Java in years, and never Java Regex so I can't tell you the exact steps but using the pattern...

private final static String STARS_LINE_PATTERN = "\\(\\s+?(\\w+?)\\s+?\\(\\s+(\\w+)\\s+?((?:\\w+?\\s??){1,})\\s+?\\)\\s+?\\)";

...should split the string initially, in PHP I'd just use explode to split the \3 into an array to get the independent values, I'm sure you have something similar available.

Cags
A: 

How about String#split(" wp")? Drop the first result, and you will need to fudge the last, but it will do the job.

Tassos Bassoukos
A: 

It would be easier to do it without regex at all, like this:

String input = "(  ELT2N ( ELTOK wpSA910 wpSA909 wpSA908 wpSA474 ) )";
String[] tokens = input.split();
String result = "";
for (int i = 0; i < tokens.length; i++) {
  if (! tokens[i].startsWith("wp"));
    result += tokens[i] + " ";
}
Adam Schmideg
Sorry i love the idea, but the strings may not always be prefixed with 'wp'. They could be anything - numbers, text etc
TotalCruise
A: 

Java regex like most flavors can only keep the last capture when you repeat a capturing group.

For this particular problem, you may want to match the entire wp sequence into one group in one regex, and then post-process it again with another regex. In this case, a simple split is enough.

Here's a snippet to illustrate the idea:

    import java.util.regex.*;
    import java.util.*;
    //...

    String text = "(  ELT2N ( ELTOK wpSA910 wpSA909 wpSA908 wpSA474 ) )";
    String regex =
        "< (word) < (word) ((?:word )+)> >"
            .replace(" ", "\\s+")
            .replace("<", "\\(")
            .replace(">", "\\)")
            .replace("word", "\\w+");

    Matcher m = Pattern.compile(regex).matcher(text);
    if (m.find()) {
        System.out.printf("%s; %s;%n%s",
            m.group(1),
            m.group(2),
            Arrays.toString(m.group(3).split("\\s+"))
        );
    }

The above prints:

ELT2N; ELTOK;
[wpSA910, wpSA909, wpSA908, wpSA474]

So the entire wp sequence is captured by \3 of the regex pattern, which is then split into its parts.

References

Related questions

polygenelubricants
Very nice, thank you
TotalCruise