ansaurus

Question

Fastest way/algorithm to find one different keyword between two set of text files

Answer 1

+4 A:

If you just have one keyword (or key phrase) then you're likely best off just using indexOf() or similar simple existing function calls. Your bottleneck won't be CPU or even memory bandwidth, it's just disk speed. Your CPU can search 10X faster than the disk can feed it.

If you have files already in memory and you need to scan quickly, the right algorithm is likely Boyer Moore or KMP. But don't even bother with it at first, try simple indexOf() kind of primitives and see whether that's really too slow for you or not. Computers are FAST and you will likely be surprised.

SPWorley 2009-04-05 00:16:51

Answer 2

+1 A:

First, generate all the keywords in each file. (This is pretty boilerplate, I guess)

Now, create a set or hashset (basically, it allows you to check if a string is part of a collection very very quickly) of keywords for each file. (Google for code/details, they're in virtually every language)

After this is done, all you have to do is iterate through every possible keyword and check if it is present in exactly 2 of the files. Since you're using a hashset, each lookup will take only a few operations - so overall, this should be quite fast.

v3 2009-04-05 00:29:16

Yeah, this is basically what I've suggested. See my answer for a few specific details if you're curious.

Noldorin 2009-04-05 00:38:52

Answer 3

+2 A:

This would seem to be the sort of thing for which a hashtable would be perfect. Storage and retrieval of hashtable entries is possible in O(1) time, and can be used quite effectively here. I would recommend trying something like the following algorithm:

Create a Dictionary<string, int> (this is effectively a generic hashtable, available from .NET 2.0 onwards). This will be used to keep track of the occurrences of each keywords (the value will act as a bit field).
Load each text file and read all the keywords, setting the appropiate bit for the corresponding text file in which the keyword is found. Example:
```
dict[keyword] |= (1 << curTextFileIndex);
```
where curTextFileIndex would vary from 0 to 3 in your case.
Iterate over all entries in the dictionary, looking for the appropiate value (bit field). In your case, because you are looking for a keyword that appears in the first two files but not the last two, the value you want to search for is 0011 (or 3 in decimal). Find this entry and you have your keyword.

Unless I'm mistaken, this algorithm runs in O(n) time, where n is the total number of keywords in all your text files. I don't think you're going to get better than that, truthfully.

Hope that helps. Let me know if you need a few more details...

Edit: Hrmm... I seemed to have missed the bit about your "keywords" possibly containing more than one actual word. If these "keywords" are known to be shorted than a certain (lowish) number of words, then I think this solution may still be viable with small modifications. Otherwise, you'll need something a bit more clever, it would appear.

Noldorin 2009-04-05 00:37:40

Thanks a lot for detailed description also I liked the bit trick. It works and reasonable fast.

dr. evil 2009-04-05 09:18:58

Ah, glad it works well. :)

Noldorin 2009-04-05 10:58:43

ansaurus

tags:

views:

answers:

Fastest way/algorithm to find one different keyword between two set of text files

related questions