This would seem to be the sort of thing for which a hashtable would be perfect. Storage and retrieval of hashtable entries is possible in O(1) time, and can be used quite effectively here. I would recommend trying something like the following algorithm:
- Create a
Dictionary<string, int>
(this is effectively a generic hashtable, available from .NET 2.0 onwards). This will be used to keep track of the occurrences of each keywords (the value will act as a bit field).
Load each text file and read all the keywords, setting the appropiate bit for the corresponding text file in which the keyword is found.
Example:
dict[keyword] |= (1 << curTextFileIndex);
where curTextFileIndex
would vary from 0 to 3 in your case.
- Iterate over all entries in the dictionary, looking for the appropiate value (bit field). In your case, because you are looking for a keyword that appears in the first two files but not the last two, the value you want to search for is 0011 (or 3 in decimal). Find this entry and you have your keyword.
Unless I'm mistaken, this algorithm runs in O(n) time, where n is the total number of keywords in all your text files. I don't think you're going to get better than that, truthfully.
Hope that helps. Let me know if you need a few more details...
Edit: Hrmm... I seemed to have missed the bit about your "keywords" possibly containing more than one actual word. If these "keywords" are known to be shorted than a certain (lowish) number of words, then I think this solution may still be viable with small modifications. Otherwise, you'll need something a bit more clever, it would appear.