tags:

views:

5511

answers:

7

Hello all. I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.

Example input:

This is a string that "will be" highlighted when your 'regular expression' matches something.

Desired output:

This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.

Note that "will be" and 'regular expression' retain the space between the words.

TIA,
Carl

A: 

It'll probably be easier to search the string, grabbing each part, vs. split it.

Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.

(not actual Java)

string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";

regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();

while (string.length > 0) {
    string = string.trim();
    if (Regex(regex).test(string)) {
        final.push(Regex(regex).match(string)[0]);
        string = string.replace(regex, ""); // progress to next "word"
    }
}


Also, capturing single quotes could lead to issues:

"Foo's Bar 'n Grill"

//=>

"Foo"
"s Bar "
"n"
"Grill"
Jonathan Lonowski
Your solution doesn't handle single-quoted strings, which are part of Carl's example.
Jan Goyvaerts
A: 

String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:

String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);

for (int i = 0; i < len; i++)
{
    m.region(i, len);

    if (m.lookingAt())
    {
        String s = m.group(1);

        if ((s.startsWith("\"") && s.endsWith("\"")) ||
            (s.startsWith("'") && s.endsWith("'")))
        {
            s = s.substring(1, s.length() - 1);
        }

        System.out.println(i + ": \"" + s + "\"");
        i += (m.group(0).length() - 1);
    }
}

which produces the following output:

0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
Zach Scrivena
A: 

I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.

rmeador
It's possible with a regex, see some of the samples I linked to. There's a few variations on this, and I've seen several similar questions on SO that address this via regular expressions.
Jay
Know when not to use regex is more helpful knowledge then to be able to create a (?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Rene
+3  A: 

There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:

UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?

m/('.*?'|".*?"|\S+)/g

Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).

This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.

Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)

Jay
I think your regex allows mismatched quotes, e.g. "will be' and 'regular expressions".
Zach Scrivena
@Zach - you're right, it does...updated it to fix that just in case
Jay
A: 

If you want to allow escaped quotes inside the string, you can use something like this:

(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))

Quoted strings will be group 2, single unquoted words will be group 3.

You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/

mcrumley
+13  A: 

I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:

[^\s"']+|"([^"]*)"|'([^']*)'

I added the capturing groups because you don't want the quotes in the list.

This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    if (regexMatcher.group(1)) {
        // Add double-quoted string without the quotes
        matchList.add(regexMatcher.group(1));
    } else if (regexMatcher.group(2)) {
        // Add single-quoted string without the quotes
        matchList.add(regexMatcher.group(2));
    } else {
        // Add unquoted word
        matchList.add(regexMatcher.group());
    }
}

If you don't mind having the quotes in the returned list, you can use much simpler code:

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
}
Jan Goyvaerts
Jan, thanks for your response. BTW, I'm a big fan of EditPad.
carlsz
+1  A: 
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s

This will match the spaces not surrounded by double quotes. I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.

Marcus Andromeda
+1 for an approach I've never seen before, and a cool approach it is! If it weren't for those thrice-damned quantifiers, I'd even be tempted to call it elegant. :P
Alan Moore