You seem to be misunderstanding how character classes (e.g. [a-f0-9]
, or [^aeiouy]
) work. /[^abcd]/
doesn't negate the pattern abcd
, it says "match any character that's not 'a'
or 'b'
or 'c'
or 'd'
".
If you want to match the negation of a pattern, use the /(?!pattern)/
construct. It's a zero-width match - meaning it doesn't actually match any characters, it matches a position.
Similar to how /^/
and /$/
match the start and end of a string, or /\b/
matches the boundary of a word. For instance: /(?!xx)/
matches every position where the pattern "xx" doesn't start.
In general then, after you use a pattern negation, you need to match some character to move forward in the string.
So to use your pattern:
form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
form_and_fields = /#{form}((?:(?!#{form}).)+)/m
text.scan(form_and_fields)
From the inside out (I'll be using (?#comments)
)
(?!#{form})
negates your original pattern, so it matches any position where your original pattern can't start.
(?:(?!#{form}).)+
means match one character after that, and try again, as many times as possible, but at least once. (?:(?#whatever))
is a non-capturing parentheses - good for grouping.
In irb, this gives:
irb> text.scan(form_and_fields)
=> [["F004", "0309", " \n /* field 1 */ \n /* field 2 */ \n ", nil, nil], ["F004", "0409", " \n /* field 1 */ \n /* field 2 */ \n", nil, nil]]
The extra nil
s come from the capturing groups in form
that are used in the negated pattern (?!#{form})
and therefore don't capture anything on a successful match.
This could be cleaned up some:
form_and_fields = /#{form}\s*(.+?)\s*(?:(?=#{form})|\Z)/m
text.scan(form_and_fields)
Now, instead of a zero-width negative lookahead, we use a zero-width positive lookahead (?=#{form})
to match the position of the next occurrence of form
. So in this regex, we match everything until the next occurence of form
(without including that next occurence in our match). This lets us trim out some whitespace around the fields. We also have to check for the case where we hit the end of the string - /\Z/
, since that could happen too.
In irb:
irb> text.scan(form_and_fields)
=> [["F004", "0309", "/* field 1 */ \n /* field 2 */", "F004", "0409"], ["F004", "0409", "/* field 1 */ \n /* field 2 */", nil, nil]]
Note now that the last two fields are populated the first time - b/c the capturing parens in the zero-width positive lookahead matched something, even though it wasn't marked as "consumed" during the process - which is why that bit could be rematched for the second time.