tags:

views:

463

answers:

6

I am looking for a regular exression matching spaces only if thos spaces are not enclosed in double quotes ("). For example, in

Mary had "a little lamb"

it should match the first an the second space, but not the others.

I want to split the string only at the spaces which are not in the double quotes, and not at the quotes.

I am using C++ with the Qt toolkit and wanted to use QString::split(QRegExp). QString is very similar to std::string and QRegExp are basically POSIX regex encapsulated in a class. If there exist such a regex, the split would be trivial.

Examples:

Mary had "a little lamb"     =>   Mary,had,"a little lamb"
1" 2 "3                      =>   1" 2 "3 (no splitting at ")
abc def="g h i" "j k" = 12   =>   abc,def="g h i","j k",=,12

Sorry for the edits, I was very imprecise when I asked the question first. Hope it is somewhat more clear now.

+1  A: 

The question is answered here: Using regex to replace all spaces NOT in quotes in Ruby

Ron Gejman
No it isn't. That question is about *replacing*, this one's about *splitting*. All the solutions offered for that question involve matching spaces OR quoted sequences in alternation, but that won't work here.
Alan Moore
They are related questions. If you can match them you can (usually) replace them.
Ron Gejman
+3  A: 

What should happen to "a" b "c" ?

Note that in the substring " b " the spaces are between quotes.

-- edit --

I assume a space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”).

That means you need the following regex: ^[^"]*("[^"]*"[^"]*)*"[^"]* [^"]*"[^"]*("[^"]*"[^"]*)*$

("[^"]*"[^"]*) represents a pair of quotes. ("[^"]*"[^"]*)* is an even amount of quotes, ("[^"]"[^"]*)*" an odd amount. Then there's the actual quoted string part, followed by another odd number of quotes. ^$ anchors are needed because you need to count every quote from the beginning of the string. This answers the " b " substring problem above by never looking at substrings. The price is that every character in your input must be matched against the entire string, which turns this into an O(N*N) split operation.

The reason why you can do this in a regex is because there is a finite amount of memory needed. Effectively just one bit; "have I seen an odd or even number of quotes so far?". You don't actually have to match up individual "" pairs.

This is not the only interpretation possible, though. If you do include “funny Unicode quotes” which should be paired, you also need to deal with ““double quoted”” strings. This in turn means you need a count of open , which means you need infinite storage, which in turns means it's no longer a regular language, which means you can't use a regex. QED.

Anyway, even if it was possible, you still would want a proper parser. The O(N*N) behavior to count the number of quotes preceding each character just isn't funny. If you already know there are X quotes preceding Str[N], it should be an O(1) operation to determine how many quotes precede Str[N+1], not O(N). The possible answers are after all just X or X+1 !

MSalters
That’s a question and not an answer. Use a comment.
Gumbo
It's an answer with a question mark ;) The problem is that he's using the wrong tool (regex instead of a stack-based parser) for his problem. And there's no "close reason: problem can't be solved with regex"
MSalters
The reason I ask is because I wanted to avoid using a parser. I wanted the "cheap" solution. If there is no solution using regex, provide a mathematical proof and I will accept it as answer :-)Doesn't even need to be strict and rigor :-)
drhirsch
Let's give it a shot. Def: A space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”). That means you need the following regex: ^.*(".*".*)*".* .*".*(".*".*)*$Hmm, I'd better turn that into an answer :P
MSalters
Great idea of odd and even counting :-) But shouldn't there be a * after the second closing ] ? And did you try to match the spaces between the quotes? I did want the others. But anyway, thats a starting point :-)
drhirsch
I indeed missed the *, fixed.
MSalters
A: 

Simplest regex-solution: match whole spaces AND quotes. Filter quotes later

"[^"]*"|\s
maykeye
+1  A: 

If the quoting in the strings is simple (like your examples), you can use alternation. This regex first hunts for a simple quoted string; failing that it finds spaces.

/(\"[^\"]*\"| +)/

In Perl, if you use grouping in the regex when calling split(), the function returns not only the elements but also the captured groups (in this case, our delimiter). If you then filter out the blank and spaces-only delimiters, you will get the desired list of elements. I don't know whether a similar strategy would work in C++, but the following Perl code does work:

use strict;
use warnings;
while (<DATA>){
    chomp;
    my @elements = split /(\"[^\"]*\"| +)/, $_;
    @elements = grep {length and /[^ ]/} @elements;
    # Do stuff with @elements
}

__DATA__
Mary had "a little lamb"
1" 2 "3
abc def="g h i" "j k" = 12
FM
+1  A: 

MSalters pushed me on the right track. The problem with his answer that the regex he gives always matches the whole string and so is unsuitable for split(), but this can partly redeemed by a lookahead match. Assuming that the quotes are always paired (they are indeed), I can split on every space which is followed by an even number of quotes.

The regex without C escapes and in single quotes looks like

' (?=[^"]*("[^"]*"[^"]*)*$)'

In the source it finally looked like (using Qt and C++)

QString buf("Mary had \"a little lamb\""); // string we want to split
QStringList splitted = buf.split( QRegExp(" (?=[^\"]*(\"[^\"]*\"[^\"]*)*$)") );

Simple, eh?

For the performance, the strings are parsed once at the start of the program, they are a few dozen and they are less than hundred chars. I will test its runtime with long strings, just to be sure nothing bad happens ;-)

drhirsch
+5  A: 

(I know you just posted almost exactly the same answer yourself, but I can't bear to just throw all this away. :-/)

If it's possible to solve your problem with a regex split operation, the regex will have to match even numbers of quotation marks, as MSalters said. However, a split regex should match only the spaces you're splitting on, so the rest of the work has to be done in a lookahead. Here's what I would use:

" +(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

If the text is well formed, a lookahead for an even number of quotes is sufficient to determine that the just-matched space is not inside a quoted sequence. That is, lookbehinds aren't necessary, which is good because QRegExp doesn't seem to support them. Escaped quotes can be accommodated too, but the regex becomes quite a bit larger and uglier. But if you can't be sure the text is well formed, it's extremely unlikely you'll be able to solve your problem with split().

By the way, QRegExp does not implement POSIX regular expressions--if it did, it wouldn't support lookaheads OR lookbehinds. Instead, it falls into the loosely-defined category of Perl-compatible regex flavors.

Alan Moore
Well, I wrote _basically_ POSIX ;-) When I searched for lookbehinds, I noticed that there was something missing. But shortly after, I noticed I didn't even need them, assuming always paired quotes. Maybe I should file a bug at Qt. Anyway, +1 for the improvment of the regex.
drhirsch
I realize you were using the term "POSIX" loosely; I was merely pointing out others might not. I had to look up QRegExp to be sure it *wasn't* POSIX-standard (or, more specifically, that I could recommend a lookahead-based solution).
Alan Moore