views:

112

answers:

2

I'm trying to figure out how to better parse lines of text that have values that look like this:

line1  
'Line two' fudgy whale 'rolly polly'  
fudgy 'line three' whale  
fudgy whale 'line four'  
'line five' 'fish heads' 
line six  

I wish to use a single regular expression to display the desired output. I already know how to kludge it up to get the desired output but I want a single expression.

Desired output:

["line1"]
["Line two", "fudgy", "whale", "rolly polly"]
["fudgy", "line three", "whale"]
["fudgy", "whale", "line four"]
["line five", "fish heads"]
["line", "six"]

The line reading is already handled for me via Cucumber. Each line is read as one string value and I want to parse out single words and any number of words contained inside single quotes. I know less than nothing about regular expressions but I've hobbled together a regular expression using the regex "or" operator ("|") that got me close.

Taking that regex I first tried parsing each line using a string split:

text_line.split(/(\w+)|'(.*?)'/)

Which resulted in the following, less than acceptable, arrays:

["", "line1"]
["", "Line two", " ", "fudgy", " ", "whale", " ", "rolly polly"]
["", "fudgy", " ", "line three", " ", "whale"]
["", "fudgy", " ", "whale", " ", "line four"]
["", "line five", " ", "fish heads"]
["", "line", "", "six"]

I next tried using scan instead of a split and I saw this:

text_line.scan(/(\w+)|'(.*?)'/)
[["line1", nil]]
[[nil, "Line two"], ["fudgy", nil], ["whale", nil], [nil, "rolly polly"]]
[["fudgy", nil], [nil, "line three"], ["whale", nil]]
[["fudgy", nil,], ["whale", nil], [nil, "line four"]]
[[nil, "line five"], [nil, "fish heads"]]
[["line", nil], [nil, "six",]]

So I could see the regex "or" operator was producing a value for each possible "or" position which made sense. Knowing that I figured out I could use scan, flatten, and compact to clean it up giving me the desired output:

text_line.scan(/(\w+)|'(.*?)'/).flatten.compact
["line1"]
["Line two", "fudgy", "whale", "rolly polly"]
["fudgy", "line three", "whale"]
["fudgy", "whale", "line four"]
["line five", "fish heads"]
["line", "six"]

But using the scan, flatten, and compact looks incredibly ugly and it seems like I'm just monkey patching my own bad regular expression. I'm thinking instead of ham-handedly fixing the sloppy output from my poorly constructed regex I should just write a better regular expression.

So, is it possible to use a single regular expression to parse the above lines and get the desired output? I may be way off on the regex to begin with but I'm thinking if I could just somehow group the or's so they only return one value per group that would probably be what I'm looking for.

Please feel free to suggest alternate solutions but I'm looking for elegant solutions done the Ruby way since I'm trying to teach myself how to use the language.

Thanks in advance for your time.

edited to incorporate tininfi's better, more accurate regex

+2  A: 

If you want to get array of arrays of different size, you may do it in two steps: .split and .scan. In your case .scan has () on two sides of |, that's why you have trouble with nil (Which supposed to be useful, but not it your case). So you have either use .flatten.compact or add the 3rd step of .delete.

text.split("\n").map{|i|p i.scan(/'([^']+)'|(\w+)/).flatten.compact}
text.split("\n").map{|i|p i.scan(/'[^']+'|\w+/).map{|i|i.delete "'"}}
Nakilon
Thanks for your reply. You're covering ground I'm already using or have rejected. My main focus is on doing the most "correct" Ruby which avoids the sloppy programming I'm using now. I am however very appreciative of your time and effort.
Mike Bethany
+2  A: 
tinifni
Thank you for the effort. This is actually less elegant than the solution I've already come up with; the focus of my question is how to write a better regular expression. Thanks for the try though.
Mike Bethany
You get the "answered." You not only fixed a bug in my regex, you made the code cleaner. And yes, more elegant. Now if I can manage to go a few months without saying that word again...
Mike Bethany
Well, I am glad it helped. I still wish I could find the solution that does not require a flatten and compact. Thanks for helping me get my first few points here. Happy coding!
tinifni
Did tinifni give another answer, that mine?..
Nakilon
@Nakilon. You had an extra step in there, the map, plus Tinifini's regex is cleaner. But I voted up your answer too. Thanks for your help.
Mike Bethany