ansaurus

Question

RegExp: I want to remove unnecessary words in the Sentence. How can I do it?

Answer 1

+3 A:

I could have sworn JavaScript had \b (word boundary), but it looks like it doesn't, try this instead:

var regex  = new RegExp("( |^)" + "a" + "( |$)", "g");
var string = "I saw a big cat, it had a tail.";

string = string.replace(regex, "$1$2");

Chas. Owens 2009-05-14 05:54:21

I tried this but didn't work

Keira Nighly 2009-05-14 11:33:40

Answer 2

+3 A:

First, if you are going to have to loop through each possible type of "garbageString", it is completely unnecessary to use Regex.

Secondly, you should probably be trying to search for "whole words only". This would mean that you match a garbage string only if it is preceded and followed by a word delimiter (such as whitespace in your example). If you implement this, a Regex based match becomes useful.

This code does not work, if there are any punctuation marks, but it shouldn't be too hard to change the code according to your needs.

var text = "jQuery is a Unique language";
var garbageStrings = {"of": true,
                      "the": true,
                      "in": true,
                      "on": true,
                      "at": true,
                      "to": true,
                      "a": true,
                      "is": true};

var words = text.split(" ");
var newWords = Array()
for (var i = 0; i < words.length; i++) {
    if (typeof(garbageStrings[words[i]]) == "undefined") {
        newWords.push(words[i]);
    }
}
text = newWords.join(" ");

Cerebrus 2009-05-14 05:55:16

@gs: Thanks for the edit! :-)

Cerebrus 2009-05-15 07:02:45

Answer 3

+8 A:

Something like this:

function keyword(s) {
    var words = ['of', 'the', 'in', 'on', 'at', 'to', 'a', 'is'];
    var re = new RegExp('\\b(' + words.join('|') + ')\\b', 'g');
    return (s || '').replace(re, '').replace(/[ ]{2,}/, ' ');
}

wombleton 2009-05-14 06:03:56

+1, but I'd put all those words into an array for readability, and then use .join('|') on it to put it into the regex.

nickf 2009-05-14 06:07:49

Sure. Also refined the squeeze regex.

wombleton 2009-05-15 04:37:06

Note that not only spaces are word boundaries but any character in the `\W` class. So hyphens too.

Gumbo 2009-05-16 08:32:30

Answer 4

A:

Firstly, you need to use arrays for this, not regex, because they will be faster. Regex is orders of magnitude more complex, and thus too heavy. As Atwood says, a programmer thinks he can solve a problem with a regex. Then he has two problems.

So, a quick implementation that uses your list of garbage strings, and does the job, exploiting javascript's built-in dictionary speed to check whether a word is garbage or not, and with handling for punctuation is given below. There's a little test page you can try it out on.

function splitwords(str) {
  var unpunctuated = unpunctuate(str);
  var splitted = unpunctuated.split(" ");
  return splitted;
}

function unpunctuate(str) {
  var punctuation = ['.', ',', ';', ':', '-'];
  var unpunctuated = str;
  for(punctidx in punctuation) {
    punct = punctuation[punctidx];
    // this line removes punctuation. to keep it, swap in the line below.
    //unpunctuated = unpunctuated.replace(punct," "+punct+" ");
    unpunctuated = unpunctuated.replace(punct,"");
  }
  return unpunctuated;
}


var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];

var garbagedict= {};

for(garbstr in garbageStrings) {
  garbagedict[garbageStrings[garbstr]] = 1;
}

function remove(str) {
  words = splitwords(str);
  keeps = [];
  for(wordidx in words) {
    word = words[wordidx];
    if(word in garbagedict) {
      // ignore
    } else {
      keeps.push(word);
    }
  }
  return keeps.join(" ");
}

Phil H 2009-05-14 15:54:28

Atwood didn't come up with that quote, not even close. http://en.wikipedia.org/wiki/Jamie_Zawinski

Paolo Bergantino 2009-05-15 05:56:00

Atwood loves Regex, wtf? I'd delete that code in a second if I saw it in source.

Chad Grant 2009-05-15 07:29:07

@Chad: So provide something better.@Paulo: I heard it from Atwood, and it's not a quote.The point of this code is that it does what the OP wants to do. Regex is great for pattern matching, but this isn't pattern matching. It's simple word comparison. Simple is definitely better here.

Phil H 2009-05-15 09:39:12

The most voted up answer is what I would do, it's elegant, readable and easy to maintain. I don't see anything simple about 40 lines of code compared to 3. There is no reason to avoid Regex's at all costs. Regex has a \b for a reason. -1 for Quoting someone to give yourself credibility. +1 for showing a different way albeit messy.

Chad Grant 2009-05-15 11:20:23

Here's a 3rd option. (for fun only : 2 lines) var str = "jQuery is a Unique language"; [str = str.replace(new RegExp('\\b'+ i +'\\b','gi'),'').replace(/\s{2,}/gi,' ') for each (i in ["of","the","in","on","at","to","a","is"])];

Chad Grant 2009-05-15 11:23:34

Answer 5

A:

Please, don't use RegExp for this, it's dirty and unnecessary, and takes up too many cycles. Easier:

var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];
for(var i=0; i < garbageString.length; i++){
    string.replace(" "+garbageStrings[i]+" ", "");
}

or using arrays:

var garbageStrings = ['of', 'the', "in", "on", "at", "to", "a", "is"];
var str = str.split(" ");
for(var i=0; i < garbageStrings.length; i++){
    for(var j=0; j < str.length; j++){
        if(str[j].toLowerCase() === garbageStrings[i]){
            str.splice(j, 1);
        }
    }
}
str = str.join(" ");

Dmitri Farkov 2009-05-14 19:23:45

Answer 6

A:

Like wombleton said. ;)

Except I would remove whitespace as part of the regex itself, rather than use a second regex for this (for better performance):

var re = new RegExp("\\b(?:"+ words.join("|") + ")\\b\\s*", "gi");
s.replace(re, "");

The regex will be compiled on object creation. On repeated operations it shouldn't be noticeably slower than looping through each stopword with a string/array operation, and it's much easier to grasp.

If you just have a short, static list of stopwords, you could instead write your own optimized regex:

var re = new RegExp("\\b(?:at?|i[ns]|o[fn]|t(?:he|o))\\b\\s*", "gi");
"jQuery is a Unique language".replace(re, "");

The idea here is that words sharing the same prefix (e.g. "of" and "on") share the same execution path up until the point where they differ. Hardly necessary in your case, but nice to know about.

kimsnarf 2009-05-15 05:48:58

ansaurus

tags:

views:

answers:

RegExp: I want to remove unnecessary words in the Sentence. How can I do it?

related questions