ansaurus

Question

Answer 1

+3 A:

As far as comparing names is concerned you might want to take a look at the Levenshtein distance algorithm. Given two strings it will calculate a distance measurement which can be used as a basis for catching duplicates.

I personally have used it in a tool I developed for an application with a rather large database that had a large number of duplicates in it. Using it in conjunction with some other data comparisons relevant to my domain I was able to point my tool at the application database and quickly find many of the duplicated records. Not going to lie, I thought it was pretty darn cool to see in action.

It's even quick to implement, here's a C# version:

public int CalculateDistance(string s, string t) {
    int n = s.Length; //length of s
    int m = t.Length; //length of t
    int[,] d = new int[n + 1, m + 1]; // matrix
    int cost; // cost

    // Step 1
    if (n == 0) return m;
    if (m == 0) return n;

    // Step 2
    for (int i = 0; i <= n; d[i, 0] = i++) ;
    for (int j = 0; j <= m; d[0, j] = j++) ;
    // Step 3
    for (int i = 1; i <= n; i++) {
        //Step 4
        for (int j = 1; j <= m; j++) {
            // Step 5
            cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

            // Step 6
            d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
        }
    }

    // Step 7
    return d[n, m];
}

Nathan Taylor 2010-05-12 14:27:00

I know of the algorithm and was intending to use it for weeding out the "easy" targets, however some artists have the same name and there is an overlap where simple text comparison won't do, which is why I'm looking for something domain specific for music data.

Christian P. 2010-05-14 11:42:42

Answer 2

+1 A:

I wrote a similar answer here: Music Recognition and Signal Processing.

In the research community, the problem of finding similarity between two signals (up to environmental distortions such as noise, mild variations in tempo, pitch, or bitrate) is known as audio (or music) fingerprinting. This topic has been studied heavily for at least a decade. This early (and oft cited) paper by Haitsma and Kalker clearly describes the problem and proposes a simple solution.

The problem of finding musical similarity between two versions of the same song is known as cover song identification. This problem is also studied heavily but is still considered open.

Perhaps the two most popular commercial solutions for content-based musical search are Midomi and Shazam.

I believe this addresses your question. Check Google Scholar for recent solutions to these problems. The ISMIR proceedings are available for free online.

Steve 2010-05-18 17:35:14

Thanks, this was exactly the type of answer I was looking for!

Christian P. 2010-05-20 23:58:39

ansaurus

tags:

views:

answers:

Comparison of music data

related questions