views:

267

answers:

4

I have an array of strings, not many (maybe a few hundreds) but often long (a few hundred chars).

Those string are, generally, nonsense and different one from the other.. but in a group of those string, maybe 5 out of 300, there's a great similarity. In fact they are the same string, what differs is formatting, punctuation and a few words..

How can I work out that group of string?

By the way, I'm writing in ruby, but if nothing else an algorithm in pseudocode would be fine.

thanks

+5  A: 

There are many ways to compare strings for similarity.

Here is a site with various similarity metrics you can use.

Andy West
A: 

It might be overkill and possibly and not an exact-fit to what you want to achieve but you might be able to use 'Ferret' to help (the Ruby version of Lucene - full-text index/search API) to sort out of the punctuation and formatting - also if the sentences differ by common 'stop-words' (the, and , is ...) these can be filtered.

Your searches will then be assigned weights: which gives an idea of similiarity.

http://www.davebalmain.com/ http://www.amazon.co.uk/Ferret-David-Balmain/dp/0596519400/ref=sr_1_2?ie=UTF8&s=books&qid=1264751909&sr=8-2

monojohnny
problem is I haven't any search to start with! I want to work out which strings are similar.. I tried levenshtein with some math to better weigh results and it's working decently.. not great but ok.
luca
Ok, but you could probably leverage Ferret to do a 'Search for Similar' - Ferret itself would assign the weights.As I said before, using Ferret might be overkill here, but worth checking the docs in case it gives you some ideas in any case.
monojohnny
+2  A: 

You can use the Levenshtein algorithm for this. Here's an implementation in Ruby.

JRL
A: 

Assuming that you are not worried about misspellings or other errors in each word, you could do the following:

Build an inverted index, which is basically a hash keyed by word, pointing to a list of pointers to the strings which contain that word (how you handle duplicate occurrences is up to you). To determine strings that are similar to a given query string, lookup each query word in the index, and for each source string in the resulting lists, count how many times the source string appears in each list. The strings with the highest counts are your best candidates for similarity, because they contain the most words in common.

Then you can compute the edit distance between the two strings, or whatever other metric you want. This way you avoid the O(n^2) complexity of comparing each string with every other string.

RyanHennig