I wonder if anyone can provide me with the regular expressions needed to parse a string like:
'foo bar "multiple word tag"'
into an array of tags like:
["foo","bar","multiple word tag"]
Thanks
I wonder if anyone can provide me with the regular expressions needed to parse a string like:
'foo bar "multiple word tag"'
into an array of tags like:
["foo","bar","multiple word tag"]
Thanks
A regex will almost certainly not be the solution that you are looking for here. Regex's are useful for parsing a set of matched input from a larger string. For example, if I wanted to just get the user name from an email address I could use the following regex to grab the data
"^(?<username>[\w\d]+)@.*$"
The name would be present in the name group "username"
In your case, you are not trying to get a subset of the input string. You are trying to match the elements of the entire string. At the end of the day, the regex will just say "yes it matches" or "no it doesn't". In order to get out the contents you will need to actually parse out the string.
You could implement a scanner to do this. For instance, in Python it'd look something like this:
import re
scanner = re.Scanner([
(r"[a-zA-Z_]\w*", lambda s,t:t), # regular tag
(r"\".*?\"", lambda s,t:t[1:-1]), # multi-word-tag
(r"\s+", None), # whitespace not in multi-word-tag
])
tags, _ = scanner.scan('foo bar "multiple word tag"')
print tags
# ['foo', 'bar', 'multiple word tag']
This is called lexical analysis.
First of all, I'd suggest doing this with a split()
method/function rather than regular expressions. Most languages have something like this which you can call to split a string into words (separated by whitespace), and you can usually specify an upper bound on how many parts you want it split into. So generically,
split('foo bar "multiple word tag"', ' ', 3)
where the 3 indicates no more than 3 parts, would work for your example. You could use a trim()
or strip()
method/function (or write one) to remove any leading and trailing quotes.
If you're intent on doing it with regular expressions, perhaps because each line could have a variable number of tags, to some extent it depends on what exactly you're using to do the parsing, since different regex engines sometimes have different ways of representing the same things. And I don't think it can be done with just a plain old regular expression by itself; you'll need some code to go along with it. For example, here's a (pseudo-?)pseudocode solution using a Perl-compatible regular expression (or something like it, anyway):
pos = 0;
while pos < length(string):
# match(regular expression, string to search, starting position for the search)
m = match(/\s*(".+?"|\S+)?\s*/, string, pos);
tag = m.group(1).strip('"');
# process the tag
For what it's worth, I would probably do this with a DFA (discrete finite automaton), which goes through the string character-by-character appending each one to a buffer and flushing the buffer when it's reached the end of a tag (either because of a space or a closing quote mark). Maybe it's just me but I feel like this is a pretty simple parsing task and it would be easier to understand (to my mind) in terms of DFA states.
General regex that will work with any match->array function:
(?<=")[^"]+|\w+
(If more than just alphanumeric and quotes are allowed, using \S+
instead of \w+
might make sense.)
Ruby example:
myarray = mystring.scan(/(?<=\")[^\"]+|\w+/)
(untested)
Here we go (Perl style):
^(?:"([^"]*?)"|(\S+?)|\s*?)*$
Explanation:
^ // from begginning
(?: // non-capturing group of three alternatives
"([^"]*?)" // capture "tag" "
|
(\S+?) // capture tag
|
\s*? // ignore whitespace
)*
$ // until the end of the line
In Ruby
scan(/\"([\w ]+)\"|(\w+)/).flatten.compact
E.g.
"foo bar \"multiple words\" party_like_1999".scan(/\"([\w ]+)\"|(\w+)/).flatten.compact
=> ["foo", "bar", "multiple words", "party_like_1999"]