ansaurus

Question

How can I use String#scan to scan multiple lines separated only by a carriage return, not a new line

Answer 1

A:

This should cover all options:

/[\r\n]+\s*D"?(.+?)[\r\n]+/m

Or, forget the newlines and match what you're looking for:

/D"?(\d{4}(?:-\d\d){2})/m

Please note that [\r\n?|\n] matches | and ? as literals. Also, your regex captures all lines that start with D.

Kobi 2010-02-15 12:15:34

Answer 2

+1 A:

Assuming all your dates are in YYYY-MM-DD format, here's a regex that should work for you:

string.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)

Testing out in irb seems to cover all your cases:

irb> str1 = "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
irb> str2 = "D2009-11-12\r\nPApple Store\r\nMSnow Leopard\r\nD2009-11-13\r\nPApple Store\r\nMiMac"
#=> "D2009-11-12\r\nPApple Store\r\nMSnow Leopard\r\nD2009-11-13\r\nPApple Store\r\nMiMac"
irb> str3 = "D2009-11-12\rPApple Store\rMSnow Leopard\rD2009-11-13\rPApple Store\rMiMac"
#=> "D2009-11-12\rPApple Store\rMSnow Leopard\rD2009-11-13\rPApple Store\rMiMac"
irb> str1.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)
#=> [["2009-11-12"], ["2009-11-13"]]
irb> str2.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)
#=> [["2009-11-12"], ["2009-11-13"]]
irb> str3.scan(/(?:^|\r?\n|\r)D(\d{4}-\d{2}-\d{2})(?:\r?\n|\r|$)/)
#=> [["2009-11-12"], ["2009-11-13"]]

The three standard linebreaks are \n, \r, and \r\n (not \n\r). So handling all three of those is done by the regex \r?\n|\r. Note that the order of the alternatives is important here, since \r|\r?\n would match the \r\n as two separate newlines due to greediness.

If you want to use gsub to do replacements to convert all your lineendings to unix, \1 is the code for a back reference, not $1. But you don't need to use back references to convert the lineendings.

string.gsub(/\r\n|\r/, "\n")

Going back into irb:

irb> str1.gsub(/\r\n|\r/, "\n")
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
irb> str2.gsub(/\r\n|\r/, "\n")
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"
irb> str3.gsub(/\r\n|\r/, "\n")
#=> "D2009-11-12\nPApple Store\nMSnow Leopard\nD2009-11-13\nPApple Store\nMiMac"

rampion 2010-02-15 12:40:13

Thanks, this is great. My actual use case is somewhat more complicated than my example in that the dates can take a variety of formats (dd/mm/yyyy, dd/mm/yy, dd MMM yyyy), none of which are known at parse-time. Also, could you possible explain the purpose of '?:'.

Olly 2010-02-15 13:10:46

@Olly So `(?:...)` is a non-capturing parentheses in regex - it allows you to group things without capturing them like regular parens do `(...)`

rampion 2010-02-15 15:35:16

@Olly If you want to parse a variety of formats, can I suggest using a library rather than implementing it yourself? If all the lines beginning with `D` are supposed to be dates, then you could do `D(.*)` rather than `D(\d{4}-\d{2}\d{2})` and pass the date strings off to `Date.parse` for verification/parsing (it will throw an `ArgumentError` on an invalid date string). `require 'date'; str1.scan(/(?:^|\r?\n|\r)D(.*)(?:\r?\n|\r|$)/).map { |(d)| Date.parse(d) rescue nil } #=> [#<Date: 2009-11-12 (4910295/2,0,2299161)>, #<Date: 2009-11-13 (4910297/2,0,2299161)>]`

rampion 2010-02-15 15:42:44

ansaurus

tags:

views:

answers:

How can I use String#scan to scan multiple lines separated only by a carriage return, not a new line

related questions