views:

120

answers:

2

I have various length strings which are full of Base64 chars. Actualy they are audio recognition datas differs by song-to-song.

For easily comparing parts of those strings i divide them into 16-char sub-strings. (which is about 1 second of a song) But in some cases, i just can't compare these ones head to head.. i should be measuring them.

For example comparison with 'hellohellohelloo' and 'hallohellohelloo' should get a closer value then 'hellohellohelloo' and 'herehellohelloo' comparison.

Is there any algorithm or theorical


Edit: Sorry, i am new here :) And i couldn't make myself clear. Here are some comments that will make me clear and proposes an idea.

Comment 1:

Actually i know about Levenshtein distance, but the problem is every time i compare two strings i have to build comparison matrix and that makes searching process slow. If i can convert for example hello to 4444 and hallo to 4443 i can determine how close records i have for 'hello' by just indexing numerical values.

Comment 2:

Maybe i should determine a base constant-length string(s) and store distance values from them as the index values for string. It's just an idea?!

A: 

Levenshtein's distance will probably help you : http://en.wikipedia.org/wiki/Levenshtein_distance

It's usually pretty fast, and there are implementations in most modern languages too.

Nicolas
Actually i know about Levenshtein distance, but the problem is every time i compare two strings i have to build comparison matrix and that makes searching process slow. If i can convert for example hello to 4444 and hallo to 4443 i can determine how close records i have for 'hello' by just indexing numerical values. I think i am a little bit more clear now. :)
Marcus Frex
Maybe i should determine a base constant-length string(s) and store distance values from them as the index values for string. It's just an idea?!
Marcus Frex
I'm not an expert in that field, but I'm pretty sure there are other algorythms similar to Levenshtein's, possibly closer to what you're looking for.The solution to your problem could also be in the combination of two algorythms...why not work with Levenshtein's, and add an analysis of the char-by-char distance (this method has a name but it doesn't come to mind right now ><).For instance : "hello" vs "hally" would give something like :h - h = 0e - a = 4 (b, c, d, e)l - l = 0l - l = 0o - y = 10 (p, q, r, s, t, u, v, w, x, y)See whan I mean ?
Nicolas
StackOverflow should really have comments formatting...Sorry for the wall-of-text answer :/
Nicolas
Thanks for the idea! I think i should do something like that.
Marcus Frex
A: 

The Levenshtein distance might work for you. Also see the Wikipedia overview of edit distance.

Patrice