A parallel algorithm might look like this:
- Sort the list of terms alphabetically using a parallel sort algorithm
- Divide the sorted list of terms into chunks, such that all interesting terms are in the same chunk. If I'm not mistaken, so long as each chunk starts with the same character, you'll never have interesting matches across two chunks (matches and prefixes will always have the same first character, right?)
- For each chunk, find terms that are prefixes and/or matches, and take appropriate action. This should be easy, since matching terms will be right next to each other (since the big list is sorted, each chunk will be too).
Some notes:
This requires a parallel sort algorithm. Apparently these exist, but I don't know much else about them, since I've never had to use one directly. Your Mileage May Vary.
The second step (split the workload into chunks) doesn't appear to itself be parallelizable. You could implement it with modified binary searches to find where the first character changes, so hopefully this part is cheap, but it might not be, and you probably won't know for sure until you measure.
If you end up with many chunks, and one is by far the largest, your performance will suck.
Have you considered keeping the algorithm single-threaded, but changing it so that the first step sorts the list?
Currently the algorithm described in the question is O(n^2), as it loops through the list once per element in the list. If the list is sorted, then duplicates can be found in one pass through the list (duplicates will be right next to each other) -- including the sort, this is a total cost of O(n log n). For large data sets, this will be much, much, faster. Hopefully it will be fast enough that you can avoid multiple threads, which will be a lot of work.