tags:

views:

223

answers:

6

After finishing two cs classes, I've started working on a personal project in Java. I am writing a program that will look through a music collection and attempt to set the 'Composer' tag by looking at the filename and meta tags. I am comparing these against a composer list I have created as a simple text file. My question is this:

What is a good method for comparing two strings to try to find a best match of sorts? For exammple, in my case suppose I have a file called 'Pulenc - Gloria in excelsis Deo.flac'. In my composer list I have 'Poulenc, Francis'. I want to be able to read 'Pulenc', and see that it is very close to 'Poulenc' so that I can have the composer tag set correctly. A friend suggested I look into using Cosine Distance (which I'd never heard of before), and another recommended Levenshtein Distance. Are either of these a good approach or are there other methods that may work better?

+2  A: 

There's a project called SimMetrics, run by the University of Sheffield in the UK that will be able to help you out. I wrote a little bit about it in my blog from a .NET point of view, but I believe the project also has a Java implementation available.

ZombieSheep
+3  A: 

It sounds like the Levenshtein Distance is exactly what you need. The Cosine Distance seems to deal with longer texts, and phonetic algorithms like Soundex will probably yield poor results for names, most of which are not intended to be pronounced using English pronounciation rules.

Michael Borgwardt
A: 

Levenshtein distance is a reasonable idea, although if you have a lot of composers in your system, it may perform badly. Unlike Soundex (or Metaphone, or NYSIIS), edit distance algorithms make you compare your misspelled composer name to every other composer in the system. Depending on how many there are, this could take a while.

As a (premature?) optimisation, it may be worth only computing Levenshtein distance for composers whose name starts with the right letter.

Simon Nickerson
A: 

Peter Norvig has written an excellent piece, "How to Write a Spelling Corrector", that you may find useful, and that you can tweak to your specific needs.

David
A: 

By the way, the Apache Commons Lang lib implements a Levenshtein Distance calculator: http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringUtils.html#getLevenshteinDistance(java.lang.String,%20java.lang.String)

mjd79
A: 

I think in your case Damerau–Levenshtein distance should work fine. If you have more data, use it. In the absence of good algorithm, large amount of data can compensate.

Hao Wooi Lim