tags:

views:

91

answers:

2

I have a method which scans plain text (specifically in the QIF format) looking for dates which occur after a 'D' on a new line:

dates = "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac".scan(/^\s*D"?(.+?)[\r\n?|\n]/m)
# => [["2009-11-12"], ["2009-11-13"]]

"D2009-11-12\r\nPApple Store\r\nMSnow Leopard\r\nD2009-11-13\r\nPApple Store\r\nMiMac".scan(/^\s*D"?(.+?)[\r\n?|\n]/m)
# => [["2009-11-12"], ["2009-11-13"]]

This works well across a variety of format, but I've just come across an issue with files generated from Quicken on the Mac, which saves them in MacOS Classic format. That is to say the lines are delimited using carriage returns, not new lines (i.e. '\r' not '\n' or '\n\r').

"D2009-11-12\rPApple Store\rMSnow Leopard\rD2009-11-13\rPApple Store\rMiMac".scan(/^\s*D"?(.+?)[\r\n?|\n]/m)
# => [["2009-11-12"]]

The problem appears to be that Ruby's multi-line regex code doesn't consider '\r' to be a new line delimiter (which of course it isn't).

What is the best way to support the original parsing yet also handle these Mac OS Classic files?

Should I replace all occurrances of '\r' with '\n\r' and, if so, how should I go about doing this since a call to string.gsub(/\r/, '\n\r') will result in \n\r\r being replaced in some scenarios. I would like to call string.gsub(/[^\n]\r/, '$1\n\r') but this isn't supported by the gsub method.

A: 

This should cover all options:

/[\r\n]+\s*D"?(.+?)[\r\n]+/m

Or, forget the newlines and match what you're looking for:

/D"?(\d{4}(?:-\d\d){2})/m

Please note that [\r\n?|\n] matches | and ? as literals. Also, your regex captures all lines that start with D.

Kobi
+1  A: 

Assuming all your dates are in YYYY-MM-DD format, here's a regex that should work for you:

string.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)

Testing out in irb seems to cover all your cases:

irb> str1 = "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
irb> str2 = "D2009-11-12\r\nPApple Store\r\nMSnow Leopard\r\nD2009-11-13\r\nPApple Store\r\nMiMac"
#=> "D2009-11-12\r\nPApple Store\r\nMSnow Leopard\r\nD2009-11-13\r\nPApple Store\r\nMiMac"
irb> str3 = "D2009-11-12\rPApple Store\rMSnow Leopard\rD2009-11-13\rPApple Store\rMiMac"
#=> "D2009-11-12\rPApple Store\rMSnow Leopard\rD2009-11-13\rPApple Store\rMiMac"
irb> str1.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)
#=> [["2009-11-12"], ["2009-11-13"]]
irb> str2.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)
#=> [["2009-11-12"], ["2009-11-13"]]
irb> str3.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)
#=> [["2009-11-12"], ["2009-11-13"]]

The three standard linebreaks are \n, \r, and \r\n (not \n\r). So handling all three of those is done by the regex \r?\n|\r. Note that the order of the alternatives is important here, since \r|\r?\n would match the \r\n as two separate newlines due to greediness.

If you want to use gsub to do replacements to convert all your lineendings to unix, \1 is the code for a back reference, not $1. But you don't need to use back references to convert the lineendings.

string.gsub(/\r\n|\r/, "\n")

Going back into irb:

irb> str1.gsub(/\r\n|\r/, "\n")
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
irb> str2.gsub(/\r\n|\r/, "\n")
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
irb> str3.gsub(/\r\n|\r/, "\n")
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
rampion
Thanks, this is great. My actual use case is somewhat more complicated than my example in that the dates can take a variety of formats (dd/mm/yyyy, dd/mm/yy, dd MMM yyyy), none of which are known at parse-time. Also, could you possible explain the purpose of '?:'.
Olly
@Olly So `(?:...)` is a non-capturing parentheses in regex - it allows you to group things without capturing them like regular parens do `(...)`
rampion
@Olly If you want to parse a variety of formats, can I suggest using a library rather than implementing it yourself? If all the lines beginning with `D` are supposed to be dates, then you could do `D(.*)` rather than `D(\d{4}-\d{2}\d{2})` and pass the date strings off to `Date.parse` for verification/parsing (it will throw an `ArgumentError` on an invalid date string). `require 'date'; str1.scan(/(?:^|\r?\n|\r)D(.*)(?:\r?\n|\r|$)/).map { |(d)| Date.parse(d) rescue nil } #=> [#<Date: 2009-11-12 (4910295/2,0,2299161)>, #<Date: 2009-11-13 (4910297/2,0,2299161)>]`
rampion