ansaurus

Question

How can I extract a variable number of sub-matches from a Ruby regex?

Answer 1

+1 A:

This is what I managed to do :

([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)

This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.

If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Colin Hebert 2010-10-03 08:04:33

I just tried this and it seems to work on the rubular site. Let me review your answer. What does "?=" do?

Leon Adeoye 2010-10-03 08:16:07

+1 using two regexes is the convenient alternative to the split-and-loop suggestion I made. Not sure why you use `(?:C|P)` in place of `[CP]` though.

Tomalak 2010-10-03 08:16:14

@Tomalak, I didn't really thought about it. But you're right, it's more clear with `[CP]` (and it's also updated)

Colin Hebert 2010-10-03 08:29:57

@Leon Adeoye, `?=` is a positive look ahead, it will check that the following pattern is applied without "capturing" the elements. You can read more at http://www.regular-expressions.info/lookaround.html But as said in previous comments and an the end of the answer, if you want to match and capture, two regexes is a better way.

Colin Hebert 2010-10-03 08:33:03

Leon Adeoye 2010-10-03 14:10:43

when /^(?:[+-]?\d?[CP])*$/.match(field); while(!field.nil? puts "#{b.to_a.inspect}"; puts field = field.sub!(/([+-]?\d?)([CP])/, ""); end

Leon Adeoye 2010-10-03 14:11:32

Answer 2

+2 A:

You have a profound (but common) misunderstanding how character classes work. This:

[C|P]

is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:

[CP]

Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:

^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$

Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.

You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.

If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.

As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.

Tomalak 2010-10-03 08:05:27

@Tomalak very informative.

typoknig 2010-10-03 08:09:43

Thanks for your input Tomalek, let me review your response. New to pattern matching so forgive my [C|P] misunderstanding.

Leon Adeoye 2010-10-03 08:18:43

ansaurus

tags:

views:

answers:

How can I extract a variable number of sub-matches from a Ruby regex?

related questions