views:

634

answers:

2

Hi everyone.

I have a small problem with the core data application i'm currently writing. I have two differents models, contexts and peristent stores. One is for my app data, the other one is for a website with relevant infos to me.

Most of the time, I match exactly one record from my app to another record from the other source. Sometimes however, I have to fallback to fuzzy string matching to link the two records. I'm trying to match song titles. My local title could be the (made up) "The French Idealist is in your pensée" and the remote song title could be "01 - 10 - French idealist in in you're pensee, The (dub remix, feat. DJ Objective-C)"

I search stack overflow, Google, the cocoa documentation, and I can't find any clear answer on how to do a fuzzy matching in these cases. My strings can start with anything, have a bunch of special characters, usually end with random or to be ignored characters.

Regexp won't do, nor NSPredicates, Soundex doesn't work well with foreign names, and maybe the Levenshtein won't be enough (or will it ?).

I'm looking for a title in a set of about a dozen potential matches, but I hava to do this operation quite a lot. 100% accuracy is not the goal.

I was thinking of removing the ignored words, extracting the keywords (in this example, "french, idealist, pensée"), concatenate them, and then use the Levenshtein distance (words in song title should be in the same order).

In my special case, would it work ? What is the industry standard regarding this problem (I can't be the only one in the world who want to match slightly different songs names) Can Core Data, Cocoa or Objective-C help me ?

Thanks a lot.

+1  A: 

You want your search to be diacritic insensitive to match the 'é' in pensée and 'e' in pensee. You get this by adding the [d] after the attribute. Like so:

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@)", yourSongSubstring];
The 'c' in [cd] is for case insensitivity.

Since your string could appear in any order in the string you are searching, you could tokenize your search string ([... componentsByString:@" "]) then create a predicate like

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@) and (songTitle like[cd] %@)", songToken1, songToken2];
That syntax to combine predicates above may be off, going from memory.

baalexander
Well, I first tried a variation of this and when I parse real world data, it doesn't quite work. Most of the time, the problem is not is the diacritics or case but in subtlely spelled differences (as in "Backstreet girl" vs "Back Street Girl"). This solution is also heavily depend on the previous step, tokenization, which is really hard for the domain "words that could appear in a song title"
damdamdam
A: 

I believe the tool you want to use here is SearchKit. I say that as if I've just made your job easy.... I haven't, but it should have the tools you need to be successful here. LNC is still offering their SearchKit Podcast for free (very nice).

Each track would be a document in this case, and you'd need to come up with a good way to index them with an identifier that can be used to find them. You can then load them up with metadata, and search them. Perhaps putting the title "in" the document would be helpful here to facilitate the use of Similarity Searching (kSKSearchOptionFindSimilar). That may or may not work really well.

The question you've asked is a good one, but there is certainly no industry standard for it because anyone who solves this problem well (i.e. every major search engine) keeps their algorithms very secret. This is a hard problem; no one is quite ready to give away their answer.

Rob Napier
SearchKit. I completely forgotten about this API. I looked very hard at the doc, I saw immediate uses in my app for it, but I think it's way too involved just to appoximate a match between a string and an other string.
damdamdam