tags:

views:

55

answers:

2

I am processing text using Java Regexes (1.6) which contain quantifiers and I wish to return the number and values of matched groups. A simple example is:

A BC DEF 1 23 456 7 XY Z

which is matched by:

([A-Z]+){0,9} (\d+){0,9} ([A-Z]+){0,9}

How can I find the number of each capture (here 3 4 2) and the values ("A", "BC", "DEF", "1", "23", "456", "7", "XY", "Z"). The regexes are created outside the program though I can design them to tackle this problem if possible.

+2  A: 

When matching a group more than once it is not possible to get at all the captures. You could redesign your regex like this:

((?:[A-Z]+ ?){0,9}) ((?:\d+ ){0,9}) ((?:[A-Z]+ ?){0,9})

which would give you the captures "A BC DEF", "1 23 456 7" and "XY Z", which you could then split on spaces.

sepp2k
Thank you. I had thought I might have to do this and it's useful to have it confirmed. Since I have already parsed the regex into potential capture groups I can use them to parse the larger captures
peter.murray.rust
+1  A: 

If you use a quantity on a matched group, the matched group will only return the last matching one. By that I mean for:

String s = "a ab abc";
Pattern p = Pattern.compile("(\w+){3}");
Matcher m = p.matcher(s);
if (m.match()) {
  // m.group(1) will equal "abc";
}

The alternative in your case is to do something like this:

String s = "A BC DEF 1 23 456 7 XY Z";
Pattern p = Pattern.comopile("([A-Z]+|\d+)");
Matcher m = p.matcher(s);
while (m.find()) {
  // print the group
}

I realize that doesn't have quite the same semantics as your regex (in the order of letter groups and number groups) but it's a start. You can implement that kind of state checking yourself if you wish.

cletus