For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:
irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
{ :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]
If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.
A quick breakdown of the regex:
\w+
matches any single-term keywords
(?:\\.|[^\\"]])*
uses non-capturing parentheses ((?:...)
) to match the contents of an escaped double quoted string - either an escaped symbol (\n
, \"
, \\
, etc.) or any single character that's not an escape symbol or an end quote.
"((?:\\.|[^\\"]])*)"
captures only the contents of a quoted keyword phrase.
(?:(\w+)|"((?:\\.|[^\\"])*)")
matches any keyword - single term or phrase, capturing single terms into $1
and phrase contents into $2
\d+
matches a number.
\^(\d+)
captures a number following a caret (^
). Since this is the third set of capturing parentheses, it will be caputred into $3
.
(?:\^(\d+))?
captures a number following a caret if it's there, matches the empty string otherwise.
String#scan(regex)
matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1
becomes match[0]
, $2
becomes match[1]
, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil
entry in the resulting "match".
The #map
then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match
), and then creates your desired hashes. Exactly one of word
or phrase
will be nil
, since both can't be matched against the input, so (word || phrase)
will return the non-nil
one, and #downcase
will convert it to all lowercase. boost.to_i
will convert a string to an integer while (boost.nil? ? nil : boost.to_i)
will ensure that nil
boosts stay nil
.