views:

283

answers:

4

I know this is not a straight up question, so if you need me to provide more information about the scope of it, let me know. There are a bunch of questions that address almost the same issue (they are linked here), but never the exact same one with the same kind of scope and objective - at least as far as I know.

Context:

  • I have a MP3 file with ID3 tags for artist name and song title.
  • I have two tables Artists and Songs
  • The ID3 tags might be slightly off (e.g. Mikaell Jacksonne)
  • I'm using ASP.NET + C# and a MSSQL database

I need to synchronize the MP3s with the database. Meaning:

  1. The user launches a script
  2. The script browses through all the MP3s
  3. The script says "Is 'Mikaell Jacksonne' 'Michael Jackson' YES/NO"
  4. The user pick and we start over

Examples of what the system could find:

In the database...

SONGS = {"This is a great song title", "This is a song title"}
ARTISTS = {"Michael Jackson"}

Outputs...

"This is a grt song title" did you mean "This is a great song title" ?
"This is song title" did you mean "This is a song title" ?
"This si a song title"  did you mean "This is a song title" ?
"This si song a title"  did you mean "This is a song title" ?
"Jackson, Michael" did you mean "Michael Jackson" ?
"JacksonMichael" did you mean "Michael Jackson" ?
"Michael Jacksno" did you mean "Michael Jackson" ?

etc.

I read some documentation from this /how-do-you-implement-a-did-you-mean and this is not exactly what I need since I don't want to check an entire dictionary. I also can't really use a web service since it's depending a lot on what I already have in my database. If possible I'd also like to avoid dealing with distances and other complicated things.


I could use the google api (or something similar) to do this, meaning that the script will try spell checking and test it with the database, but I feel there could be a better solution since my database might end up being really specific with weird songs and artists, making spell checking useless.

I could also try something like what has been explained on this post, using Soundex for c#.

Using a regular spell checker won't work because I won't be using words but names and 'titles'.


So my question is: is there a relatively simple way of doing this, and if so, what is it?

Any kind of help would be appreciated.

Thanks!

+3  A: 

What you want is a similarity factor. Essentially, you want to compare your inputs ("Micheal Jackson", for example) to your expected values ("Michael Jackson"); if you score a very high similarity value to one of your expected values, you can ask the user.

One way of doing this is to hash the expected values into a fully packed hashtable. If you get your hashing algorithm right (and yes, this is the tricky bit), each input will hash to the closest expected value; once you've found the closest expected value, you can run a similarity evaluation against the input and that expected value; if you're above a certain threshold, ask the user.

McWafflestix
I didn't think of the hash, but that's true and very clever! Do you have pointers on where to look for such an hash algorithm?
marcgg
@marcgg: You might try google for such a hash algorithm; however, you may need to do a lot of customization of it for your expected data set. I don't know of any good references for such a thing off the top of my head...
McWafflestix
+1  A: 

A fairly simple but relatively inaccurate system would be to compare the characters of the strings, and measure the number of characters that are different/missing/added in the user's string. If the number of characters is few enough (you might try weighting the differences based on key distance [lookup table] or somesuch), then ask the user if they meant a specific given string

Sukasa
this could work as a last resort. Meaning I try to lookup exact match, then something else maybe more accurate and then this. I'm sure there will be something like 30% of match, maybe less, but that's still an interesting idea, thanks!
marcgg
The biggest hangup here is going to be how you compare the characters, since "ABCDE" should only be one character away from "ACDE", not 4 characters away as one old attempt of mine gave (can you figure out why?)
Sukasa
It was doing that because you were testing the exact position of letters. I guess doing a mix like "same letter-same position: +10, same letter-different position: +1" could give something a bit more robust. Or not.
marcgg
Yep, I wasn't checking to see if a letter had been added or removed when I ran through the string. Of course, you can't really tell how many letter the user might be off by.
Sukasa
+1  A: 

This is a non-trivial task. Check out Wikipedia for more info about algorithms that deal with this. You hit on soundex already, but there are other transformations that you're looking for here.

Adam Hawkes
+1  A: 

This sounds very similar to creating a spell checker, which is best done with a ternary search tree. The link uses Java for it's example, but the data structure is the important part. The data structure behaves like a Hash with the properties that were mentioned by McWafflestix.

Graphics Noob

related questions