tags:

views:

31

answers:

1

I'm trying to parse lines of the form:

command arg1[ arg2, ... argn]

such as:

usemtl weasels

or

f 1/2/3 4/5/6 7/8/9

Here is my regex:

^(\\w+)(( \\S+)+)$

When I parse the line "usemtl weasels", I get the following capture groups:

Match 0: 'usemtl weasels'
Match 1: 'usemtl'
Match 2: ' weasels'

Why the space before the second match group? It doesn't show up in Rubular.

+1  A: 

Grouping in java regex is a little strange. Group 0 gives you the complete match of your regex - this is the same in all regex implementations I know. But group n (for n >= 1) will give you the last match of the n th declared group, not the n th match found.

Your second match gives you ' weasels' with a leading blank, because your pattern contains that blank. You declared your 2nd group (( \\S+)+) and this group gives you the second match.

If you apply your pattern to the string a b c d, your group 0 will be a b c d, group 1 will be a, group 2 will be b c d and group 3 will be d, because this is the last match of your 3rd declared (inner) group ( \\S+).

tangens
Hm. From my perspective, *"the last match of the n'th declared group"* is the only logical thing. What regex engine gives you the n'th match found? This makes no sense at all.
Tomalak
OK, perhaps it was only me who was confused about this, because I expected to be able to reference all matches even if a group matched multiple times.
tangens
Following your logic, would that mean that in `(a)*(b)` the `b` would be represented by different numbers depending on how often `a` matched? That's just not right. ;-) The .NET framework [supports `CaptureCollection`](http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.capturecollection(v=VS.90\).aspx) which lets you do this kind of thing. However, that's a rather unusual feature with regex engines.
Tomalak