ansaurus

Question

Fuzzy text (sentences/titles) matching in C#

Answer 1

+3 A:

It sounds like what you want may be a longest substring match. That is, in your example, two files like

trash..thash..song_name_mp3.mp3 and garbage..spotch..song_name_mp3.mp3

would end up looking the same.

You'd need some heuristics there, of course. One thing you might try is putting the string through a soundex converter. Soundex is the "codec" used to see if things "sound" the same (as you might tell a telephone operator). It's more or less a rough phonetic and mispronunciation semi-proof transliteration. It is definitely poorer than edit distance, but much, much cheaper. (The official use is for names, and only uses three characters. There's no reason to stop there, though, just use the mapping for every character in the string. See wikipedia for details)

So my suggestion would be to soundex your strings, chop each one into a few length tranches (say 5, 10, 20) and then just look at clusters. Within clusters you can use something more expensive like edit distance or max substring.

Greg 2008-09-10 06:37:49

Levenshtein's distance (already being used) is a better algorithm here than a phonetic one like soundex, which also only looks at the start of a word.

Keith 2008-09-10 06:49:17

Answer 2

A:

Your problem here may be distinguishing between noise words and useful data:

Rolling_Stones.Best_of_2003.Wild_Horses.mp3
Super.Quality.Wild_Horses.mp3
Tori_Amos.Wild_Horses.mp3

You may need to produce a dictionary of noise words to ignore. That seems clunky, but I'm not sure there's an algorithm that can distinguish between band/album names and noise.

Keith 2008-09-10 06:59:10

I have bands list, I'm ignoring them in keywords.

Lukas Šalkauskas 2008-09-10 07:05:30

Answer 3

A:

There's a lot of work done on somewhat related problem of DNA sequence alignment (search for "local sequence alignment") - classic algorithm being "Needleman-Wunsch" and more complex modern ones also easy to find. The idea is - similar to Greg's answer - instead of identifying and comparing keywords try to find longest loosely matching substrings within long strings.

That being sad, if the only goal is sorting music, a number of regular expressions to cover possible naming schemes would probably work better than any generic algorithm.

ima 2008-09-11 06:44:57

Answer 4

A:

Look at www.match-logics.com It does exactly what you want.

paul 2010-05-11 15:46:15

ansaurus

tags:

views:

answers:

Fuzzy text (sentences/titles) matching in C#

related questions