views:

354

answers:

6

I am hoping I am wording this correctly to get across what I am looking for.

I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.

If i take a md5 hash of an email and change one character the hash changes dramatically I want something to not change too much. I need to compare how alike two pieces of content are without storing the string.

Update: I am looking now at combining some ideas from the various links people have provided. Ideally I would of liked a single input function to create my score so I am looking at using a reference string to always compare my input to. I am also looking at taking asci characters and suming these up. Still reading all the links provided.

+7  A: 

What you're looking for is a LCS algorithm (see also Levenshtein distance). You may also try Soundex or some other phonetic algorithm.

Anton Gogolev
+1  A: 

Check their Levenshtein Distance

In PHP you even have the levenshtein() function that makes exactly that.

Seb
+1  A: 

I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.

It really depends on what you mean by "same" or "different". For example, if someone replaces "United States of America" with "USA" in your string, is that mostly the same string (because USA is just an abbreviation for something longer), or is it very different (because a lot of characters changed)?

You essentially need to either devise a function that describes how to compute "sameness" or use a pre-existing definition thereof. For example, the aforementioned Levenshtein distance measures total difference based on the number of changes you have to make to get to the original string.

John Feminella
Thanks John for my purposes USA and United States Of America would be differenent.
Paul Whelan
+1  A: 

Since the Levenshtein distance needs both input strings to produce a value, you would have to store all strings.

You could, however, use a small number of strings as markers and only store these as strings.

You would then calculate the Levenshtein distance from a new string to each of these marker strings and store these values. You could then guess that two strings that have a similar Levenshtein distance to all markers are also similar to each other. It would likely be sensible to "engineer" these markers in such a way that their mutual Levenshtein distance is as large as possible. I don't know whether there has been some research in this direction.

Svante
+1  A: 

Many people have suggested looking at distance/metric like approaches, and I think the wording of the question leads that way. (By the way, a hash like md5 is trying to do pretty much the opposite thing that a metric does, so it's hardly surprising that this wouldn't work for you. There are similar ideas that don't change much under small deltas, but I suspect they don't encode enough information for what you want to do)

Particularly given your update in the comments though, I think this type of approach is not very helpful.

What you are looking for is more of a clustering problem, where you want to generate a signature (i.e. feature vector) from each email and later compare it to new inputs. So essentially what you have is a machine learning problem. Deciding what "close" means may be a bit of a challenge. To get started though, assuming it actually is emails you're looking at you may do well to look at the sorts of feature generation done by many spam-filters, this will give you (probably Euclidean, at least to start) a space to measure distances in based on a signature (feature vector).

Without knowing more about your problem it's hard to be more specific.

simon
+4  A: 

Reading your comments, it sounds like you are actually trying to compare entire documents, each containing many words.

This is done successfully in information retrieval systems by treating documents as N-dimensional points in space. Each word in the language is an axis. The distance along the axis is determined by the number of times that word appears in the document. Similar documents are then "near" each other in space.

This way, the whole document doesn't need to be stored, just its word counts. And usually the most common words in the language are not counted at all.

erickson
Thanks erickson very interesting reading
Paul Whelan