I'm trying to figure out how to better parse lines of text that have values that look like this:
line1
'Line two' fudgy whale 'rolly polly'
fudgy 'line three' whale
fudgy whale 'line four'
'line five' 'fish heads'
line six
I wish to use a single regular expression to display the desired output. I already know how to kludge it up to get the desired output but I want a single expression.
Desired output:
["line1"]
["Line two", "fudgy", "whale", "rolly polly"]
["fudgy", "line three", "whale"]
["fudgy", "whale", "line four"]
["line five", "fish heads"]
["line", "six"]
The line reading is already handled for me via Cucumber. Each line is read as one string value and I want to parse out single words and any number of words contained inside single quotes. I know less than nothing about regular expressions but I've hobbled together a regular expression using the regex "or" operator ("|") that got me close.
Taking that regex I first tried parsing each line using a string split:
text_line.split(/(\w+)|'(.*?)'/)
Which resulted in the following, less than acceptable, arrays:
["", "line1"]
["", "Line two", " ", "fudgy", " ", "whale", " ", "rolly polly"]
["", "fudgy", " ", "line three", " ", "whale"]
["", "fudgy", " ", "whale", " ", "line four"]
["", "line five", " ", "fish heads"]
["", "line", "", "six"]
I next tried using scan instead of a split and I saw this:
text_line.scan(/(\w+)|'(.*?)'/)
[["line1", nil]]
[[nil, "Line two"], ["fudgy", nil], ["whale", nil], [nil, "rolly polly"]]
[["fudgy", nil], [nil, "line three"], ["whale", nil]]
[["fudgy", nil,], ["whale", nil], [nil, "line four"]]
[[nil, "line five"], [nil, "fish heads"]]
[["line", nil], [nil, "six",]]
So I could see the regex "or" operator was producing a value for each possible "or" position which made sense. Knowing that I figured out I could use scan, flatten, and compact to clean it up giving me the desired output:
text_line.scan(/(\w+)|'(.*?)'/).flatten.compact
["line1"]
["Line two", "fudgy", "whale", "rolly polly"]
["fudgy", "line three", "whale"]
["fudgy", "whale", "line four"]
["line five", "fish heads"]
["line", "six"]
But using the scan, flatten, and compact looks incredibly ugly and it seems like I'm just monkey patching my own bad regular expression. I'm thinking instead of ham-handedly fixing the sloppy output from my poorly constructed regex I should just write a better regular expression.
So, is it possible to use a single regular expression to parse the above lines and get the desired output? I may be way off on the regex to begin with but I'm thinking if I could just somehow group the or's so they only return one value per group that would probably be what I'm looking for.
Please feel free to suggest alternate solutions but I'm looking for elegant solutions done the Ruby way since I'm trying to teach myself how to use the language.
Thanks in advance for your time.
edited to incorporate tininfi's better, more accurate regex