I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
- Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
- Some examples:
- Searches for "car", also add "cars" to the query.
- Searches for "carry", also add "carries" and "carried" to the query.
- Searches for "small", also add "smaller" and "smallest" to the query.
- Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
- And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
- Drawbacks:
- Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
- I'm not sure about the performance of searching the thesaurus file.
- Some examples:
- Generate the alternate forms algorithmically, based on some heuristics.
- Some examples:
- If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
- If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
- If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
- Try adding "s", "es", "er" and "est" to the word.
- Drawbacks:
- Generates lots of non-words for most inputs.
- Feels like a hack.
- Looks like something you'd find on TheDailyWTF.com. :)
- Some examples:
- Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").