tags:

views:

2717

answers:

8

I would like to use the .Net Regex.Split method to split this input string into an array. It must split on whitespace unless it is enclosed in a quote.

Input: Here is "my string"    it has "six  matches"

Expected output:

  1. Here
  2. is
  3. my string
  4. it
  5. has
  6. six  matches

What pattern do I need? Also do I need to specify any RegexOptions?

A: 

EDIT: Sorry for my previous post, this is obviously possible.

To handle all the non-alphanumeric characters you need something like this:

MatchCollection matchCollection = Regex.Matches(input, @"(?<match>[^""\s]+)|\""(?<match>[^""]*)""");
foreach (Match match in matchCollection)
        {
            yield return match.Groups["match"].Value;
        }

you can make the foreach smarter if you are using .Net >2.0

Grzenio
+2  A: 

Shaun,

I believe the following regex should do it

(?<=")\w[\w\s]*(?=")|\w+

Regards,
Lieven

Lieven
Thanks for the answer but this does not appear to take the quotes into account.
Shaun Bowe
it does here? What gives?
Lieven
Sorry, you're right. My mistake
Lieven
Replace the \w+ with the regex from Bartek and you're ready to go
Lieven
That regex doesn't drop the " though.
Lieven
i need a regexp for javascript split() function for splitting words on white space except for those in quotes. i couldnt use the one you wrote, do you know how to write one in javascript?
weng
+4  A: 

This regex will split based on the case you have given above, although it does not strip the quotes or extra spaces, so you may want to do some post processing on your strings. This should correctly keep quoted strings together though.

"[^"]+"|\s?\w+?\s
John Conrad
Thanks for the answer. This is very close. Close enough that I will use it for now. I will leave the question open for a day or so to see if there is a more complete answer. Otherwise I will accept this.
Shaun Bowe
"([^"]+)"|\s?(\w+?)\s will return "-stripped strings
f3lix
i need a regexp for javascript split() function for splitting words on white space except for those in quotes. i couldnt use the one you wrote, do you know how to write one in javascript?
weng
+1  A: 

With a little bit of messiness, regular languages can keep track of even/odd counting of quotes, but if your data can include escaped quotes (\") then you're in real trouble producing or comprehending a regular expression that will handle that correctly.

Liudvikas Bukys
+10  A: 

No options required

Regex:

\w+|"[\w\s]*"

C#:

Regex regex = new Regex(@"\w+|""[\w\s]*""");

Or if you need to exclude " characters:

    Regex
        .Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));
Bartek Szabat
VERY CLOSE! Now all I need is to preserve the whitespace in the matches.
Shaun Bowe
+1 for using named group to exclude quotes _transparently_.
Anton
A: 

Take a look at LSteinle's "Split Function that Supports Text Qualifiers" over at Code project

Here is the snippet from his project that you’re interested in.

using System.Text.RegularExpressions;

public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
    string _Statement = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))", 
                        Regex.Escape(delimiter), Regex.Escape(qualifier));

    RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
    if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;

    Regex _Expression = New Regex(_Statement, _Options);
    return _Expression.Split(expression);
}

Just watch out for calling this in a loop as its creating and compiling the Regex statement every time you call it. So if you need to call it more then a handful of times, I would look at creating a Regex cache of some kind.

Adam L
+4  A: 

Lieven's solution gets most of the way there, and as he states in his comments it's just a matter of changing the ending to Bartek's solution. The end result is the following working regEx:

(?<=")\w[\w\s]*(?=")|\w+|"[\w\s]*"

Input: Here is "my string" it has "six matches"

Output:

  1. Here
  2. is
  3. "my string"
  4. it
  5. has
  6. "six matches"

Unfortunately it's including the quotes. If you instead use the following:

(("((?<token>.*?)(?<!\\)")|(?<token>[\w]+))(\s)*)

And explicitly capture the "token" matches as follows:

    RegexOptions options = RegexOptions.None;
    Regex regex = new Regex( @"((""((?<token>.*?)(?<!\\)"")|(?<token>[\w]+))(\s)*)", options );
    string input = @"   Here is ""my string"" it has   "" six  matches""   ";
    var result = (from Match m in regex.Matches( input ) 
                  where m.Groups[ "token" ].Success
                  select m.Groups[ "token" ].Value).ToList();

    for ( int i = 0; i < result.Count(); i++ )
    {
        Debug.WriteLine( string.Format( "Token[{0}]: '{1}'", i, result[ i ] ) );
    }

Debug output:

Token[0]: 'Here'
Token[1]: 'is'
Token[2]: 'my string'
Token[3]: 'it'
Token[4]: 'has'
Token[5]: ' six  matches'
Timothy Walters
i need a regexp for javascript split() function for splitting words on white space except for those in quotes. i couldnt use the one you wrote, do you know how to write one in javascript?
weng
A: 

If you'd like to take a look at a general solution to this problem in the form of a free, open-source javascript object, you can visit http://splitterjsobj.sourceforge.net/ for a live demo (and download). The object has the following features:

  • Pairs of user-defined quote characters can be used to escape the delimiter (prevent a split inside quotes). The quotes can be escaped with a user-defined escape char, and/or by "double quote escape." The escape char can be escaped (with itself). In one of the 5 output arrays (properties of the object), output is unescaped. (For example, if the escape char = /, "a///"b" is unescaped as a/"b)
  • Split on an array of delimiters; parse a file in one call. (The output arrays will be nested.)
  • All escape sequences recognized by javascript can be evaluated during the split process and/or in a preprocess.
  • Callback functionality
  • Cross-browser consistency

The object is also available as a jQuery plugin, but as a new user at this site I can only include one link in this message.

Brian W