views:

34

answers:

3

I am running a split in javascript with /\s+(AND|OR)(?=\s+")\s+/ on

"email" IS NOT NULL AND "email" LIKE '%gmail.com' OR "email" = '[email protected]'

Now, my understanding of regular expressions would lead me to expect obtaining the following array:

[0]: "email" IS NOT NULL
[1]: "email" LIKE '%gmail.com'
[2]: "email" = '[email protected]'

Note: I got rid of the delimiters for clarity.

However, I obtain

[0]: "email" IS NOT NULL
[1]:  AND
[2]: "email" LIKE '%gmail.com'
[3]:  OR
[4]: "email" = '[email protected]'

when running on Firefox 3.6.8, Chrome 5.0.375.126 and Safari 5.0.1 on OS X 10.6.4.

However, when I tried on an up to date IE8 8.0.6 with default settings and I obtain what I was expecting at first. PHP 5.2.10 with preg_split does also split it this way.

My guess is that for once the 'good' browsers got it wrong but I'd like more opinions.

Edit: The example I gave here with emails is a naive example. Basically I don't know what each member can be. "xyz" = '1' AND "zyx" = 'test AND toast' is another possible input string.

What I know of the structure is that the whole string will have the following pattern:

"<attribute>" <operator> '<value>'( (AND|OR) "<attribute>" <operator> '<value>')*

Note: spaces actually represent \s+

+1  A: 

Try splitting on /\b(?:AND|OR)\b/, and trim the resulting parts.

Be aware that boolean operators have precedence rules and you cannot just split on AND and OR without losing meaning. Also, boolean expressions can (in theory) be enclosed in nested parentheses, which basically rules out regular expressions as a technology to parse them.

Tomalak
This wouldn't work since you can very well conceive a case where a test string would contain ' AND '.The example I gave here with emails is a simple example. Basically I don't know what each member can be."xyz" = '1' AND "zyx" = 'test AND toast' is another possible input.
Guillaume Bodi
@Guillaume: Exactly. This is why you must never parse a structured language with regular expressions. This applies to HTML the same way it applies to nested strings or Boolean expressions. It. does. not. work. Use (or write) an actual parser for this problem. Have a look at [PEG.js](http://pegjs.majda.cz/) to to generate a JS-based parser from a grammar you define.
Tomalak
@Tomalak: Interesting library. I'll give it a try when I have some more time. However for what I am doing it's probably a bit of an overkill.I am perfectly aware of the implications of mixing both AND and OR operators without parentheses but this tool will be operated by a handful of competent staff. That's why we opted to go for this way of doing stuff for simple conditions. For other purposes, we provide a text editor.
Guillaume Bodi
+1  A: 

This will return the result you want:

var string = "\"email\" IS NOT NULL AND \"email\" LIKE '%gmail.com' OR \"email\" = '[email protected]'"
string.split(/\s+(?:AND|OR)\s+/)
jigfox
I updated my question to explain why it doesn't cut it.
Guillaume Bodi
A: 

It looks like Firefox and Chrome got it perfectly right, since according to the specs of ECMAScriptv5 section 15.5.4.14

If separator is a regular expression that contains capturing parentheses, then each time separator is matched the results (including any undefined results) of the capturing parentheses are spliced into the output array.

For example,

"A<B>bold</B>and<CODE>coded</CODE>".split(/<(\/)?([^<>]+)>/)

evaluates to the array

["A", undefined, "B", "bold", "/", "B", "and", undefined, "CODE", "coded", "/", "CODE", ""]

Pointer to the specs by Chris Leary of Mozilla.

Guillaume Bodi