views:

19

answers:

1

I have an application that lets users publish unstructured keywords. Simultaneously, other users can publish items that must be matched to one or more specified keywords. There is no restriction on the keywords either set of users may use, so simply hoping for a collision is likely to mean very few matches, when the reality is users might have used different keywords for the same thing or they are close enough (eg, 'bicycles' and 'cycling', or 'meat' and 'food').

I need this to work on mobile devices (Android), so I'm happy to sacrifice matching accuracy for efficiency and a small footprint. I know about s-match but this relies on a backing dictionary of 15MB, so it isn't ideal.

What other ideas/approaches/frameworks might help with this?

+1  A: 

Your example of 'bicycles' and 'cycling' could be addressed by a take on the Levenshtein edit-distance algorithm since the two words are somewhat related. But your example of 'meat' and 'food' would indeed require a sizable backing dictionary, unless of course the concept set or target audience is limited to say, foodies.

Have you considered hosting the dictionary as a web service and accessing the data as needed? The drawback of course is that your app would only work while in network coverage.

Paul Sasik
thanks, yes, i could well end-up deploying a matching service on a server, but ideally i wouldnt have to. Is the levenshtein thing a bit like a hamming distance?
MalcomTucker
Looked at both on Wikipedia and it looks to me that there's little or no difference between the two when it comes to string matching. the algorithm and results seem about the same. I know that Levenshtein uses a 2D char array to calculate its result. Hamm dist only shows cubes and hypercubes as possible data structures, and that may be the difference.
Paul Sasik