views:

824

answers:

6

I wonder if anyone can provide me with the regular expressions needed to parse a string like:

'foo bar "multiple word tag"'

into an array of tags like:

["foo","bar","multiple word tag"]

Thanks

A: 

A regex will almost certainly not be the solution that you are looking for here. Regex's are useful for parsing a set of matched input from a larger string. For example, if I wanted to just get the user name from an email address I could use the following regex to grab the data

"^(?<username>[\w\d]+)@.*$"

The name would be present in the name group "username"

In your case, you are not trying to get a subset of the input string. You are trying to match the elements of the entire string. At the end of the day, the regex will just say "yes it matches" or "no it doesn't". In order to get out the contents you will need to actually parse out the string.

JaredPar
+2  A: 

You could implement a scanner to do this. For instance, in Python it'd look something like this:

import re
scanner = re.Scanner([
    (r"[a-zA-Z_]\w*", lambda s,t:t),       # regular tag
    (r"\".*?\"",      lambda s,t:t[1:-1]), # multi-word-tag
    (r"\s+",          None),               # whitespace not in multi-word-tag
    ])
tags, _ = scanner.scan('foo bar "multiple word tag"')
print tags
# ['foo', 'bar', 'multiple word tag']

This is called lexical analysis.

Evan Fosmark
That looks really nice. Any idea if this is possible in Ruby?
There probably is. Even if Ruby doesn't have a scanner class available like Python does, you can write your own as long as you have regex matching capabilities. Unfortunately, I don't know Ruby.
Evan Fosmark
A: 

First of all, I'd suggest doing this with a split() method/function rather than regular expressions. Most languages have something like this which you can call to split a string into words (separated by whitespace), and you can usually specify an upper bound on how many parts you want it split into. So generically,

split('foo bar "multiple word tag"', ' ', 3)

where the 3 indicates no more than 3 parts, would work for your example. You could use a trim() or strip() method/function (or write one) to remove any leading and trailing quotes.

If you're intent on doing it with regular expressions, perhaps because each line could have a variable number of tags, to some extent it depends on what exactly you're using to do the parsing, since different regex engines sometimes have different ways of representing the same things. And I don't think it can be done with just a plain old regular expression by itself; you'll need some code to go along with it. For example, here's a (pseudo-?)pseudocode solution using a Perl-compatible regular expression (or something like it, anyway):

pos = 0;
while pos < length(string):
    # match(regular expression, string to search, starting position for the search)
    m = match(/\s*(".+?"|\S+)?\s*/, string, pos);
    tag = m.group(1).strip('"');
    # process the tag

For what it's worth, I would probably do this with a DFA (discrete finite automaton), which goes through the string character-by-character appending each one to a buffer and flushing the buffer when it's reached the end of a tag (either because of a space or a closing quote mark). Maybe it's just me but I feel like this is a pretty simple parsing task and it would be easier to understand (to my mind) in terms of DFA states.

David Zaslavsky
1. In every language I know that has a built-in split method/function, it uses regexes. 2. Your split example only works if you already know how many tokens there should be, and only the last token has embedded spaces. 3. Three hundred characters is nowhere near enough. :-/
Alan Moore
A: 

General regex that will work with any match->array function:

(?<=")[^"]+|\w+


(If more than just alphanumeric and quotes are allowed, using \S+ instead of \w+ might make sense.)


Ruby example:

myarray = mystring.scan(/(?<=\")[^\"]+|\w+/)

(untested)

Peter Boughton
This seems to work nicely.
No, this code has a bug. The quotes around the multiple word tags will stay in the string. See my answer for a solution (using grouping).
eelco
I've added escapes to the quotes - is that what you meant?
Peter Boughton
Looks like SO ate your escapes. But what eelco means is that you should use groups to capture the text *inside* the quotes, and only add that to the results.
Alan Moore
Oh, it doesn't exclude the quotes... that's easy to fix - but not sure I'd use grouping - lookarounds work.
Peter Boughton
A: 

Here we go (Perl style):

^(?:"([^"]*?)"|(\S+?)|\s*?)*$

Explanation:

^                    // from begginning                 
 (?:                  // non-capturing group of three alternatives
    "([^"]*?)"   // capture "tag"                                               "
 |
    (\S+?)        // capture tag
 |
    \s*?            // ignore whitespace
 )*                  
$                    // until the end of the line
edgar.holleis
+6  A: 

In Ruby

scan(/\"([\w ]+)\"|(\w+)/).flatten.compact

E.g.

"foo bar \"multiple words\" party_like_1999".scan(/\"([\w ]+)\"|(\w+)/).flatten.compact
=> ["foo", "bar", "multiple words", "party_like_1999"]
eelco
This works perfectly.
This solution won't let you use quotes inside a tag, though. For that, you'll need a quote-escaping mechanism. See http://stackoverflow.com/questions/56554/
James A. Rosen
I just did this in Java. Used Matcher.find and Matcher.group to loop over a String as scan does above. Regex worked perfectly, thanks!
Darren Greaves