views:

366

answers:

3

Hi I was wondering whether anyone could offer some advice on the fastest / most efficient way to compre two arrays of strings in javascript.

I am developing a kind of tag cloud type thing based on a users input - the input being in the form a written piece of text such as a blog article or the likes.

I therefore have an array that I keep of words to not include - is, a, the etc etc.

At the moment i am doing the following:

Remove all punctuation from the input string, tokenize it, compare each word to the exclude array and then remove any duplicates.

The comparisons are preformed by looping over each item in the exclude array for every word in the input text - this seems kind of brute force and is crashing internet explorer on arrays of more than a few hundred words.

i should also mention my exclude list has around 300 items.

Any help would really be appreciated.

Thanks

A: 

You could use a hashing function for strings (I don't know if JS has one but i'm sure uncle Google can help ;] ). Then you would calculate hashes for all the words in your exclude list and create an array af booleans indexed by those hashes. Then just iterate through the text and check the word hashes against that array.

pablochan
thanks for the reply, I will definately look into that but how much faster can this be since you are still essentially iterating over the same number of elements the same number of times arent you?
David
Nope. The algorithm that you presented has O(n*m*k) complexity where n is the exclude list size, m - text size and k is the average number of operations in string comparison. The method im proposing has O(n) complexity for the initial hashing and O(m) for every comparison
pablochan
+4  A: 

I'm not sure about the whole approach, but rather than building a huge array then iterating over it, why not put the "keys" into a map-"like" object for easier comparison?

e.g.

var excludes = {};//object
//set keys into the "map"
excludes['bad'] = true;
excludes['words'] = true;
excludes['exclude'] = true;
excludes['all'] = true;
excludes['these'] = true;

Then when you want to compare... just do

var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0;i<wordsToTest.length;i++){
  checkWord = wordsToTest[i];
  if(excludes[checkword]){
    //bad word, ignore...
  } else {
    //good word... do something with it
  }
}

allows these words through ['are','my','to','check','for']

scunliffe
+1 Yup, that's how I'd do it, too...
jdv
To prevent any chance of this going wrong because of augmentation of `Object.prototype` (for example, if a library has added an `each` method to `Object.prototype`, "each" will be considered a bad word in the example code), you could use jshashtable (http://www.timdown.co.uk/jshashtable/).
Tim Down
this makes sense. I have implemented it and ti works great in firefox but it still crashes ie as it did before.I wonder if ie is known for problems like this or if my code can be improved any.
David
Edit: I have just tested my code in chrome, opera, firefox and safari and it works super fast. In ie it fails miserably and i have to restart the browser :(
David
@Tim Down: That's a very good reason NOT to use frameworks that mash the default namespace until it's unrecognizable.
Kyle Butt
@David - do you have a url where we can see the whole the thing? There may be something else tripping up IE.
scunliffe
if its crashing the browser then do you think it would be better to use an ajax call and do all the calculation on the server and simply return a result arrya?
David
I would still be curious as to *what* is causing the crash. Especially if it works in Firefox, Chrome etc. just fine. Can you post the list of exclude words and a sample test array? Maybe there is something we overlooked.
scunliffe
+1  A: 

It would be worth a try to combine the words into a single regex, and then compare with that. The regex engine's optimizations might allow the search to skip forward through the search text a lot more efficiently than you could do by iterating yourself over separate strings.

Pointy