Hi all,
I'm looking for some guidance about which techniques/algorithms I should research to solve the following problem. I've currently got an algorithm that clusters similar-sounding mp3s using acoustic fingerprinting. In each cluster, I have all the different metadata (song/artist/album) for each file. For that cluster, I'd like to pick the "best" song/artist/album metadata that matches an existing row in my database, or if there is no best match, decide to insert a new row.
For a cluster, there is generally some correct metadata, but individual files have many types of problems:
- Artist/songs are completely misnamed, or just slightly mispelled
- the artist/song/album is missing, but the rest of the information is there
- the song is actually a live recording, but only some of the files in the cluster are labeled as such.
- there may be very little metadata, in some cases just the file name, which might be artist - song.mp3, or artist - album - song.mp3, or another variation
A simple voting algorithm works fairly well, but I'd like to have something I can train on a large set of data that might pick up more nuances than what I've got right now. Any links to papers or similar projects would be greatly appreciated.
Thanks!