views:

535

answers:

3

I've got 4 text files, 2 of them contains a keyword which other 2 text files don't.

What's the fastest way/algorithm to find this "keyword" shared in the first 2 text files but don't exist in the other 2 files?

I can think of really slow ways such as go word by word and then search with IndexOf etc. But sounds like it's going to be really slow. Especially if the file number increases.

Extra 1 : Keywords can be a single word such as "apple" or a sentence "Have you seen the apple tree?". As soon as other two text files doesn't include this keyword it doesn't matter. But I assume performance wise shorter would be better.

Extra 2 : These text files actually simple HTML sources therefore expected to be big.

+4  A: 

If you just have one keyword (or key phrase) then you're likely best off just using indexOf() or similar simple existing function calls. Your bottleneck won't be CPU or even memory bandwidth, it's just disk speed. Your CPU can search 10X faster than the disk can feed it.

If you have files already in memory and you need to scan quickly, the right algorithm is likely Boyer Moore or KMP. But don't even bother with it at first, try simple indexOf() kind of primitives and see whether that's really too slow for you or not. Computers are FAST and you will likely be surprised.

SPWorley
+1  A: 

First, generate all the keywords in each file. (This is pretty boilerplate, I guess)

Now, create a set or hashset (basically, it allows you to check if a string is part of a collection very very quickly) of keywords for each file. (Google for code/details, they're in virtually every language)

After this is done, all you have to do is iterate through every possible keyword and check if it is present in exactly 2 of the files. Since you're using a hashset, each lookup will take only a few operations - so overall, this should be quite fast.

v3
Yeah, this is basically what I've suggested. See my answer for a few specific details if you're curious.
Noldorin
+2  A: 

This would seem to be the sort of thing for which a hashtable would be perfect. Storage and retrieval of hashtable entries is possible in O(1) time, and can be used quite effectively here. I would recommend trying something like the following algorithm:

  1. Create a Dictionary<string, int> (this is effectively a generic hashtable, available from .NET 2.0 onwards). This will be used to keep track of the occurrences of each keywords (the value will act as a bit field).
  2. Load each text file and read all the keywords, setting the appropiate bit for the corresponding text file in which the keyword is found. Example:

    dict[keyword] |= (1 << curTextFileIndex);
    

    where curTextFileIndex would vary from 0 to 3 in your case.

  3. Iterate over all entries in the dictionary, looking for the appropiate value (bit field). In your case, because you are looking for a keyword that appears in the first two files but not the last two, the value you want to search for is 0011 (or 3 in decimal). Find this entry and you have your keyword.

Unless I'm mistaken, this algorithm runs in O(n) time, where n is the total number of keywords in all your text files. I don't think you're going to get better than that, truthfully.

Hope that helps. Let me know if you need a few more details...

Edit: Hrmm... I seemed to have missed the bit about your "keywords" possibly containing more than one actual word. If these "keywords" are known to be shorted than a certain (lowish) number of words, then I think this solution may still be viable with small modifications. Otherwise, you'll need something a bit more clever, it would appear.

Noldorin
Thanks a lot for detailed description also I liked the bit trick. It works and reasonable fast.
dr. evil
Ah, glad it works well. :)
Noldorin