tags:

views:

598

answers:

8

I would like to replace the | with OR only in unquoted terms, eg:

"this | that" | "the | other" -> "this | that" OR "the | other"

Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").

+4  A: 

Regexes do not count. That's what parsers are for.

chaos
Yes, this problem screams for a state machine.
Sean Cavanagh
+3  A: 

You might find the Perl FAQ on this issue relevant.

#!/usr/bin/perl

use strict;
use warnings;

my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";
Sinan Ünür
A: 

Perhaps you're looking for something like this:

(?<=^([^"]*"[^"]*")+[^"|]*)\|
Jeremy Stein
A: 

You don't need to count, because you don't nest quotes. This will do:

#!/usr/bin/perl

my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";

while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
        $str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}

print "$str\n";

Now, let's explain that expression.

^  -- means you'll always match everything from the beginning of the string, otherwise
      the match might start inside a quote, and break everything

(...)\|   -- this means you'll match a certain pattern, followed by a |, which appears
             escaped here; so when you replace it with $1OR, you keep everything, but
             replace the |.

(?:...)*  -- This is a non-matching group, which can be repeated multiple times; we
             use a group here so we can repeat multiple times alternative patterns.

[^"|\\]*  -- This is the first pattern. Anything that isn't a pipe, an escape character
             or a quote.

\\.       -- This is the second pattern. Basically, an escape character and anything
             that follows it.

"(?:...)*" -- This is the third pattern. Open quote, followed by a another
              non-matching group repeated multiple times, followed by a closing
              quote.

[^\\"]    -- This is the first pattern in the second non-matching group. It's anything
             except an escape character or a quote.

\\.       -- This is the second pattern in the second non-matching group. It's an
             escape character and whatever follows it.

The result is as follow:

" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"
Daniel
A: 

Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:

"this | that" | "the | other" | yet | another  -> "this | that" OR "the | other" OR yet OR another

Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.

A: 

It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.

str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");

Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.

Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.

Alan Moore
A: 

Another approach (similar to Alan M's working answer):

str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');

The part inside the first group (spaced for readability):

".+?"  |  \w+

... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".

epost
A: 

@Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.

@epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.

(".+?"|[^\"\s]+)\s*\|\s*