Hi,
A simple approach is as follows,
- Get a list of the top 1000 most common English words and remove these from all tags.
- Make everything lower case
- Sort
This approach is okay but of course you then have a lot of manual work involved. You will need to do the clustering yourself.
So, you might want to use a similarity measure between names and cluster with that. Assuming you had such a measure (I'll describe a simple one in a bit), then you can proceed as follows,
Take the mean and variance of the similarity score between songs. You can do this across all songs or sample if you have too many.
Create an empty set of clusters, C = {}.
Iterate over all songs, for each song iterate over all clusters, if the average score between the song and the songs in the cluster is above 2 - 3 standard deviations from the mean pair score, then add it to the cluster. If there is no such cluster, then create a new cluster with that song and add it to C.
So the "2 or 3" will need to be fitted yourself manually, but once you've got that magic number the process will be more or less automatic.
Once you have these clusters you'll need to create a representative name for that song cluster. This can be accomplished by just taking one at random, or trying to find a similar song in a known list of song names. Then assign the designated name to all songs with names in that cluster.
A simple similarity measure that could work well is just to count the number of length 1, 2, 3, ..., n substrings that are common in both strings. You would weight the counts by how long the substring is, e.g. sharing substrings of length 3 is more significant than length 1. Then, to not bias against songs with really long names, you would normalize the score by the length of the song titles being compared.
Regards,
Owen.