views:

401

answers:

7

Suppose I have a large and legal collection of music that has been inconsistently tagged. I want to normalise a single field, eg Artist, so that, for example, the following artists all become the one artists:

  • the Grateful Dead
  • the gratefull dead
  • grateful dead
  • grateful dead, The

So... should I be looking at trying to match against a large database like freedb, or is there some lower-level string-manipulation approach, etc?

I have tried hand-rolling code to do string-similarity checks, and then try and match these to high-frequency data in freedb, with limited success.

Any more ideas?

+3  A: 

Musicbrainz is recommended, and no coding is required!

SCdF
+2  A: 

For this problem I used MusicBrainz Picard which takes an audio fingerprint and then fills in the right tags automatically from their DB. I used it for filling tracknumbers myself, but I'm pretty sure it will solve your problem as well.

Michiel Borkent
A: 
  1. You need some basic API to read and write tags.
  2. Specify the key token for such artist, so every char sequence which will match it will added to collection for the token,
  3. For any token you'll need Artist name which will be used instead of original multiple variants.
  4. You can use RegEx or Parser solution for this what you like more...

Cheers.

dimarzionist
A: 

I'd first go with a sorting approach + manual hand editing.

Sort them by the name ignoring lower/upper cases and a/an/the articles in the name.

After it's been sorted, the result should be a better candidate for string matching.

After that, an algorithm that looks to the 2-3 entries above and below any single song and compute the similarity of the names should be possible.

Then it's manual work from there. I wouldn't trust algorithms to tag my music library automatically, they are not good enough, not even freedb.

chakrit
A: 

I would suggest researching algorithms for data cleansing

Sergio Acosta
+1  A: 

Hi,

A simple approach is as follows,

  1. Get a list of the top 1000 most common English words and remove these from all tags.
  2. Make everything lower case
  3. Sort

This approach is okay but of course you then have a lot of manual work involved. You will need to do the clustering yourself.

So, you might want to use a similarity measure between names and cluster with that. Assuming you had such a measure (I'll describe a simple one in a bit), then you can proceed as follows,

  1. Take the mean and variance of the similarity score between songs. You can do this across all songs or sample if you have too many.

  2. Create an empty set of clusters, C = {}.

  3. Iterate over all songs, for each song iterate over all clusters, if the average score between the song and the songs in the cluster is above 2 - 3 standard deviations from the mean pair score, then add it to the cluster. If there is no such cluster, then create a new cluster with that song and add it to C.

So the "2 or 3" will need to be fitted yourself manually, but once you've got that magic number the process will be more or less automatic.

Once you have these clusters you'll need to create a representative name for that song cluster. This can be accomplished by just taking one at random, or trying to find a similar song in a known list of song names. Then assign the designated name to all songs with names in that cluster.

A simple similarity measure that could work well is just to count the number of length 1, 2, 3, ..., n substrings that are common in both strings. You would weight the counts by how long the substring is, e.g. sharing substrings of length 3 is more significant than length 1. Then, to not bias against songs with really long names, you would normalize the score by the length of the song titles being compared.

Regards,

Owen.

Owen
+1  A: 

Python has a very, very useful library called difflib. Especially one function, difflib.get_close_matches(word, possibilities) has been of great use even in finding duplicate (or almost duplicate) filenames. Apart from that, using MusicBrainz data can be accomplished through the musicbrainz2 package, and its findartist.py script gives you example code using ArtistFilter for fuzzy-matched results which you could use.

ΤΖΩΤΖΙΟΥ