tags:

views:

84

answers:

2

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.

The pattern matching code I have is

a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)

puts result = #{a.to_a.inspect}

With the above I am able to easily match the following sample strings:

"C", "+2C", "2c-P", "2C-3P", "P+C"

And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:

result = ["+2P-C-3P", "+2", "P", "-3", "P"]

The problem is that I am unable to extract into the array, the middle pattern "-C".

What I would expect to see is:

result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]

It seems to extract only the end part "-3P" as "-3" and "P"

Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:

["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]

but what I get is

["+3c+2p-c-4p", "+3", "C", "-4", "P"]

which completely misses the middle part.

+1  A: 

This is what I managed to do :

([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)

This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.

If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Colin Hebert
I just tried this and it seems to work on the rubular site. Let me review your answer. What does "?=" do?
Leon Adeoye
+1 using two regexes is the convenient alternative to the split-and-loop suggestion I made. Not sure why you use `(?:C|P)` in place of `[CP]` though.
Tomalak
@Tomalak, I didn't really thought about it. But you're right, it's more clear with `[CP]` (and it's also updated)
Colin Hebert
@Leon Adeoye, `?=` is a positive look ahead, it will check that the following pattern is applied without "capturing" the elements. You can read more at http://www.regular-expressions.info/lookaround.html But as said in previous comments and an the end of the answer, if you want to match and capture, two regexes is a better way.
Colin Hebert
Leon Adeoye
when /^(?:[+-]?\d?[CP])*$/.match(field); while(!field.nil? puts "#{b.to_a.inspect}"; puts field = field.sub!(/([+-]?\d?)([CP])/, ""); end
Leon Adeoye
+2  A: 

You have a profound (but common) misunderstanding how character classes work. This:

[C|P]

is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:

[CP]

Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:

^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$

Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.

You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.

If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.

As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.

Tomalak
@Tomalak very informative.
typoknig
Thanks for your input Tomalek, let me review your response. New to pattern matching so forgive my [C|P] misunderstanding.
Leon Adeoye