views:

1034

answers:

8

I need to support exact phrases (enclosed in quotes) in an otherwise space-separated list of terms. Thus splitting the respective string by the space-character is not sufficient anymore.

Example:

input : 'foo bar "lorem ipsum" baz'
output: ['foo', 'bar', 'lorem ipsum', 'baz']

I wonder whether this could be achieved with a single RegEx, rather than performing complex parsing or split-and-rejoin operations.

Any help would be greatly appreciated!

A: 
'foo bar "lorem ipsum" baz'.match(/"[^"]*"|\w+/g);

the bounding quotes get included though

shyam
+1  A: 

You will find a good discussion on the subject in this question.

Christian Lescuyer
A: 

A simple regular expression will do but leave the quotation marks. e.g.

'foo bar "lorem ipsum" baz'.match(/("[^"]*")|([^\s"]+)/g)
output:   ['foo', 'bar', '"lorem ipsum"', 'baz']

edit: beaten to it by shyamsundar, sorry for the double answer

A Nony Mouse
A: 

how about,

output = /(".+?"|\w+)/g.exec(input)

then do a pass on output to lose the quotes.

alternately,

output = /"(.+?)"|(\w+)/g.exec(input)

then do a pass n output to lose the empty captures.

davidnicol
+4  A: 
var str = 'foo bar "lorem ipsum" baz';  
var results = str.match(/("[^"]+"|[^"\s]+)/g);

... returns the array you're looking for.
Note, however:

  • Bounding quotes are included, so can be removed with replace(/^"([^"]+)"$/,"$1") on the results.
  • Spaces between the quotes will stay intact. So, if there are three spaces between lorem and ipsum, they'll be in the result. You can fix this by running replace(/\s+/," ") on the results.
  • If there's no closing " after ipsum (i.e. an incorrectly-quoted phrase) you'll end up with: ['foo', 'bar', 'lorem', 'ipsum', 'baz']
yoz
The only problem with this is that all quotes are stripped - i.e. quote characters themselves are not searchable.
A: 

Try this:

var input = 'foo bar "lorem ipsum" baz';
var R =  /(\w|\s)*\w(?=")|\w+/g;
var output = input.match(R);

output is ["foo", "bar", "lorem ipsum", "baz"]

Note there are no extra double quotes around lorem ipsum

Although it assumes the input has the double quotes in the right place:

var input2 = 'foo bar lorem ipsum" baz'; var output2 = input2.match(R);
var input3 = 'foo bar "lorem ipsum baz'; var output3 = input3.match(R);

output2 is ["foo bar lorem ipsum", "baz"]
output3 is ["foo", "bar", "lorem", "ipsum", "baz"]

And won't handle escaped double quotes (is that a problem?):

var input4 = 'foo b\"ar  bar\" \"bar "lorem ipsum" baz';
var output4 = input4.match(R);

output4 is  ["foo b", "ar bar", "bar", "lorem ipsum", "baz"]
Sam Hasler
A: 

If you are just wondering how to build the regex yourself, you might want to check out Expresso (Expresso link). It's a great tool to learh how to lean how to build regular expressions so you know what the syntax means. When you've built your own expression, then you can perform a .match on it.

+1  A: 

Thanks a lot for the quick responses!

Here's a summary of the options, for posterity:

var input = 'foo bar "lorem ipsum" baz';

output = input.match(/("[^"]+"|[^"\s]+)/g);
output = input.match(/"[^"]*"|\w+/g);
output = input.match(/("[^"]*")|([^\s"]+)/g)
output = /(".+?"|\w+)/g.exec(input);
output = /"(.+?)"|(\w+)/g.exec(input);

For the record, here's the abomination I had come up with:

var input = 'foo bar "lorem ipsum" "dolor sit amet" baz';
var terms = input.split(" ");

var items = [];
var buffer = [];
for(var i = 0; i < terms.length; i++) {
    if(terms[i].indexOf('"') != -1) { // outer phrase fragment -- N.B.: assumes quote is either first or last character
     if(buffer.length === 0) { // beginning of phrase
      //console.log("start:", terms[i]);
      buffer.push(terms[i].substr(1));
     } else { // end of phrase
      //console.log("end:", terms[i]);
      buffer.push(terms[i].substr(0, terms[i].length - 1));
      items.push(buffer.join(" "));
      buffer = [];
     }
    } else if(buffer.length != 0) { // inner phrase fragment
     //console.log("cont'd:", terms[i]);
     buffer.push(terms[i]);
    } else { // individual term
     //console.log("standalone:", terms[i]);
     items.push(terms[i]);
    }
    //console.log(items, "\n", buffer);
}
items = items.concat(buffer);

//console.log(items);