views:

40

answers:

3

Hi folks! I have a list of a few thousand terms. There is significant overlap in those terms, but in different forms. For example (ruby, a_ruby), (triathlon, triathlete, triathletes), (nonprofit, non_profit, non_profits).

Most of these have significant number of character overlap, but not exactly in the same form. For example, (nonprofit and non_profit)

What regex sequence will be the best for this? I know that i can use stemming as well, but wondering how i can combine that with the regex.

+2  A: 

For a single list of a few thousand items, I'd consider an alternate approach.

Sort the list alphabetically then manually remove the duplicates. Whatever regex and subsequent processing you end up with will probably take as much time if not more than going through the list manually.

Of course, I'm assuming this is a one-time proposition. I defer to regex experts for a programmatic solution.

Bob Kaufman
A: 

I agree with Bob Kaufman that you should do a first pass to eliminate actual duplicates. After that, you have a problem that regex cannot solve for you; you will need to look into measurements of edit distance to get anywhere with it.

chaos
A: 

My usual strategy in this situation, which is not perfectly reliable, is as follows:


1) Remove all nonalphanumeric characters.
2) Make all strings lowercase.
3) Put all of the strings in a HashSet (this will remove duplicates).
4) Check for any cases where word and word+"s" are both in the set, and remove the plural one.
5) Output the strings in alphabetical order, and do a quick manual search for duplicates. If any are found, define new rules accordingly.

Other rules you may need:

  • Replace & with and.
  • Remove all instances of "inc"
  • Replace all instances of television with TV.
Brian
That sounds like a really good sequence. Will try this before i do any complex regex or algo
ming yeow