views:

861

answers:

6

I have a sentence, and I want to remove some words from it.

So if I have:

"jQuery is a Unique language"

and an array that is named garbageStrings:

var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];

I want to remove the "is" and "a" in the sentence.

But if I use this: /This statement is inside a for loop. I'm looping the whole sentence and finding a match in the garbageStrings/

var regexp = new RegExp(garbageStrings[i]);

the string will become "jQuery Unique lnguge"

Notice that the "a" in language is removed from the sentence.

I didn't intend that to happen.

+3  A: 

I could have sworn JavaScript had \b (word boundary), but it looks like it doesn't, try this instead:

var regex  = new RegExp("( |^)" + "a" + "( |$)", "g");
var string = "I saw a big cat, it had a tail.";

string = string.replace(regex, "$1$2");
Chas. Owens
I tried this but didn't work
Keira Nighly
+3  A: 

First, if you are going to have to loop through each possible type of "garbageString", it is completely unnecessary to use Regex.

Secondly, you should probably be trying to search for "whole words only". This would mean that you match a garbage string only if it is preceded and followed by a word delimiter (such as whitespace in your example). If you implement this, a Regex based match becomes useful.

This code does not work, if there are any punctuation marks, but it shouldn't be too hard to change the code according to your needs.

var text = "jQuery is a Unique language";
var garbageStrings = {"of": true,
                      "the": true,
                      "in": true,
                      "on": true,
                      "at": true,
                      "to": true,
                      "a": true,
                      "is": true};

var words = text.split(" ");
var newWords = Array()
for (var i = 0; i < words.length; i++) {
    if (typeof(garbageStrings[words[i]]) == "undefined") {
        newWords.push(words[i]);
    }
}
text = newWords.join(" ");
Cerebrus
@gs: Thanks for the edit! :-)
Cerebrus
+8  A: 

Something like this:

function keyword(s) {
    var words = ['of', 'the', 'in', 'on', 'at', 'to', 'a', 'is'];
    var re = new RegExp('\\b(' + words.join('|') + ')\\b', 'g');
    return (s || '').replace(re, '').replace(/[ ]{2,}/, ' ');
}
wombleton
+1, but I'd put all those words into an array for readability, and then use .join('|') on it to put it into the regex.
nickf
Sure. Also refined the squeeze regex.
wombleton
Note that not only spaces are word boundaries but any character in the `\W` class. So hyphens too.
Gumbo
A: 

Firstly, you need to use arrays for this, not regex, because they will be faster. Regex is orders of magnitude more complex, and thus too heavy. As Atwood says, a programmer thinks he can solve a problem with a regex. Then he has two problems.

So, a quick implementation that uses your list of garbage strings, and does the job, exploiting javascript's built-in dictionary speed to check whether a word is garbage or not, and with handling for punctuation is given below. There's a little test page you can try it out on.

function splitwords(str) {
  var unpunctuated = unpunctuate(str);
  var splitted = unpunctuated.split(" ");
  return splitted;
}

function unpunctuate(str) {
  var punctuation = ['.', ',', ';', ':', '-'];
  var unpunctuated = str;
  for(punctidx in punctuation) {
    punct = punctuation[punctidx];
    // this line removes punctuation. to keep it, swap in the line below.
    //unpunctuated = unpunctuated.replace(punct," "+punct+" ");
    unpunctuated = unpunctuated.replace(punct,"");
  }
  return unpunctuated;
}


var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];

var garbagedict= {};

for(garbstr in garbageStrings) {
  garbagedict[garbageStrings[garbstr]] = 1;
}

function remove(str) {
  words = splitwords(str);
  keeps = [];
  for(wordidx in words) {
    word = words[wordidx];
    if(word in garbagedict) {
      // ignore
    } else {
      keeps.push(word);
    }
  }
  return keeps.join(" ");
}
Phil H
Atwood didn't come up with that quote, not even close. http://en.wikipedia.org/wiki/Jamie_Zawinski
Paolo Bergantino
Atwood loves Regex, wtf? I'd delete that code in a second if I saw it in source.
Chad Grant
@Chad: So provide something better.@Paulo: I heard it from Atwood, and it's not a quote.The point of this code is that it does what the OP wants to do. Regex is great for pattern matching, but this isn't pattern matching. It's simple word comparison. Simple is definitely better here.
Phil H
The most voted up answer is what I would do, it's elegant, readable and easy to maintain. I don't see anything simple about 40 lines of code compared to 3. There is no reason to avoid Regex's at all costs. Regex has a \b for a reason. -1 for Quoting someone to give yourself credibility. +1 for showing a different way albeit messy.
Chad Grant
Here's a 3rd option. (for fun only : 2 lines) var str = "jQuery is a Unique language"; [str = str.replace(new RegExp('\\b'+ i +'\\b','gi'),'').replace(/\s{2,}/gi,' ') for each (i in ["of","the","in","on","at","to","a","is"])];
Chad Grant
A: 

Please, don't use RegExp for this, it's dirty and unnecessary, and takes up too many cycles. Easier:

var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];
for(var i=0; i < garbageString.length; i++){
    string.replace(" "+garbageStrings[i]+" ", "");
}

or using arrays:

var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];
var str = str.split(" ");
for(var i=0; i < garbageStrings.length; i++){
    for(var j=0; j < str.length; j++){
        if(str[j].toLowerCase() === garbageStrings[i]){
            str.splice(j, 1);
        }
    }
}
str = str.join(" ");
Dmitri Farkov
A: 

Like wombleton said. ;)

Except I would remove whitespace as part of the regex itself, rather than use a second regex for this (for better performance):

var re = new RegExp("\\b(?:"+ words.join("|") + ")\\b\\s*", "gi");
s.replace(re, "");

The regex will be compiled on object creation. On repeated operations it shouldn't be noticeably slower than looping through each stopword with a string/array operation, and it's much easier to grasp.

If you just have a short, static list of stopwords, you could instead write your own optimized regex:

var re = new RegExp("\\b(?:at?|i[ns]|o[fn]|t(?:he|o))\\b\\s*", "gi");
"jQuery is a Unique language".replace(re, "");

The idea here is that words sharing the same prefix (e.g. "of" and "on") share the same execution path up until the point where they differ. Hardly necessary in your case, but nice to know about.

kimsnarf