views:

455

answers:

3

Given this text:

    /* F004 (0309)00 */  
    /* field 1 */  
    /* field 2 */  
    /* F004 (0409)00 */  
    /* field 1 */  
    /* field 2 */  

how do I parse it into this array:
[
["F004"],["0309"],["/* field 1 */\n/* field 2 */"],
["F004"],["0409"],["/* field 1 */\n/* field 2 */"]
]

I got code working to parse the first two items:

form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
text.scan(form)

[
["F004"],["0309"],
["F004"],["0409"]
]

And here's the code where I try to parse all three and fail w/ an invalid regex error:

form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
form_and_fields = /#{form}(.[^#{form}]+)/m
text.scan(form_and_fields)


edit: This is what ended up working for me, thanks to both rampion, & singpolyma:

form = /
  \/\*\s+(\w+)\s+\((\d+)\)\d+\s+\*\/    #formId & edDate
  (.+?)                                 #fieldText
  (?=\/\*\s+\w+\s+\(\d+\)\d+\s+\*\/|\Z) #stop at beginning of next form
                                        # or the end of the string
/mx
text.scan(form)
+2  A: 
a.scan(/\/\*\s+(\S+)\s+\((\d+)\)\d+\s+\*\/\s+(\/\*.+\*\/\s+\n\s+\/\*.+\*\/)/)
=> [["F004", "0309", "/* field 1 */  \n    /* field 2 */"], ["F004", "0409", "/* field 1 */  \n    /* field 2 */"]]
singpolyma
+2  A: 

You seem to be misunderstanding how character classes (e.g. [a-f0-9], or [^aeiouy]) work. /[^abcd]/ doesn't negate the pattern abcd, it says "match any character that's not 'a' or 'b' or 'c' or 'd'".

If you want to match the negation of a pattern, use the /(?!pattern)/ construct. It's a zero-width match - meaning it doesn't actually match any characters, it matches a position. Similar to how /^/ and /$/ match the start and end of a string, or /\b/ matches the boundary of a word. For instance: /(?!xx)/ matches every position where the pattern "xx" doesn't start.

In general then, after you use a pattern negation, you need to match some character to move forward in the string.

So to use your pattern:

form = /\/\*\s+(\w+)\s+\((\d{4})\)[0]{2}\s+\*\//m
form_and_fields = /#{form}((?:(?!#{form}).)+)/m
text.scan(form_and_fields)

From the inside out (I'll be using (?#comments))

  • (?!#{form}) negates your original pattern, so it matches any position where your original pattern can't start.
  • (?:(?!#{form}).)+ means match one character after that, and try again, as many times as possible, but at least once. (?:(?#whatever)) is a non-capturing parentheses - good for grouping.

In irb, this gives:

irb> text.scan(form_and_fields)
=> [["F004", "0309", "  \n    /* field 1 */  \n    /* field 2 */  \n    ", nil, nil], ["F004", "0409", "  \n    /* field 1 */  \n    /* field 2 */  \n", nil, nil]]

The extra nils come from the capturing groups in form that are used in the negated pattern (?!#{form}) and therefore don't capture anything on a successful match.

This could be cleaned up some:

form_and_fields = /#{form}\s*(.+?)\s*(?:(?=#{form})|\Z)/m
text.scan(form_and_fields)

Now, instead of a zero-width negative lookahead, we use a zero-width positive lookahead (?=#{form}) to match the position of the next occurrence of form. So in this regex, we match everything until the next occurence of form (without including that next occurence in our match). This lets us trim out some whitespace around the fields. We also have to check for the case where we hit the end of the string - /\Z/, since that could happen too.

In irb:

irb> text.scan(form_and_fields)
=> [["F004", "0309", "/* field 1 */  \n    /* field 2 */", "F004", "0409"], ["F004", "0409", "/* field 1 */  \n    /* field 2 */", nil, nil]]

Note now that the last two fields are populated the first time - b/c the capturing parens in the zero-width positive lookahead matched something, even though it wasn't marked as "consumed" during the process - which is why that bit could be rematched for the second time.

rampion
Good explanation of why his way is failing. I do that that bothing with the whole negated pattern thing is overkill for this (and most) scenarios, though.
singpolyma
Agreed. But I thought it was a teachable moment.
rampion
Thanks, you did a great job explaining this. I do need to use negation because the text between forms is not uniform, the example was over simplified.
Seth Reno
A: 

For what it's worth, you might find that your code ends up a bit more readable if you expanded it out and used multiple, simpler regexes. For example (untested):

  transformed_lines = []

  text.each_line do |line|
    if line =~ /(\w|\d)+\s\(\d+)\)/
      transformed_lines << [ $1, $2, "" ]
    else
      transformed_lines.last.last << line.strip
    end
  end

Better yet, consider creating a class or simple struct for storing the results so it's a little clearer what goes where:

  transformed_lines << OpenStruct.new :thingy_one => $1, :thingy_two => $2, :fields => ""
  ...
  transformed_lines.last.fields << line.strip
Brian Guthrie