tags:

views:

676

answers:

8

I've got two directories containing ~20 GB of music files (mostly mp3, some ogg), and I would like to detect all duplicate songs. There are two complicating factors:

  1. A song may have different filenames in the two directories.
  2. Two files containing the same song may have different ID3 tags and thus have different checksums.

What is a good approach to solving this?

+1  A: 

Are the ID3/OGG-equiv artist and song metatags accurate? If they are, you could use those.

Edit: If they're not, perhaps they could be made to be... If you're only dealing with whole albums, there are several tools that will get all the tag data based on the number of tracks and their lengths.

If you're dealing with mixes of albums and single files, it gets more complicated.

Oli
Perhaps, but I don't think I can rely on them.
JesperE
+1  A: 

If you have a library that can parse the files, you can run the hash on the audio data. This will not help you if the song is a different rip or has be recompressed/transcoded/etc.

Aaron Maenpaa
It would be a bit beastly on the processing front... Just getting lots of ID3s takes an age but hashing audio data on top? Eeek!
Oli
A: 

Perhaps the Last.fm API would be useful. It includes a track.getInfo call which returns XML including the track's length, artist name, track number, etc. You could compare tracks and see if they have more than N fields equal and if so, assume they're the same track.

I have no idea about whether they're going to be OK with you submitting API requests for 40gb of music, though.

Rahul
The API is based on having semi-correct data in the first place... And, yeah, I think they'd ban your IP within the first 100 rapid-fire requests.. Nice idea though..
Oli
I actually wrote a utility that uses the Last.fm API and it included a way to limit requests and cache data received.
Kevin Lamb
A: 

I think you could compare hash-values of the file data itself. Not the audio or tag content just the file data. that would do it i think.

Keng
A: 

How about something like this: find a library to get the mp3's length as well as a pointer to the audio data (looks like there are a couple libraries out there that can do this), do a first pass filter based on song lengths, and for the songs that have matching lengths checksum their audio data. Similar to this script for finding duplicate files / images.

Parand
+2  A: 

Here's what I would do (or have done before)...

  1. Load all songs onto itunes (bear with me) (note, if you can use itunes here, then stop ... I assume your list of dupes is long and unmanageable)
  2. Delete all songs, sending them to the trash can, this way you get rid of the directory structure
  3. Obviously, don't "empty trash". Rescue the songs to a folder on your desktop
  4. Use software like mediamonkey, dupe eliminator or even itunes itself to identify the duplicates. Dupe eliminator is good in that it checks by a varying amount of factors, artist, length, filesize and whatnot and guesses what is a dupe and what isn't)
  5. Reload onto Itunes, this time around check "Auto arrange songs", which will drop your new, dupeless list onto a nice by-artist-by-album arrangement

... voila! (or if you read digg: "...profit!")

/mp

mauriciopastrana
+3  A: 

The way I have gone about this in the past is to use genpuids that come from Music IP. The closed source software creates an audio fingerprint of a file regardless of format, id3, checksum etc.

More information can be found here.

http://musicbrainz.org/doc/genpuid

This should ensure the most amount of positive duplicate matches and minimize false positives. It can also correctly tag incorrect id3 tags.

Kevin Lamb
This looks like a great idea, but it appears that there are a bunch of AAC-files in there as well, which genpuid does not support (on Linux, at least).
JesperE
A: 

I'm sure there's more elegant solutions out there - but if the audio data is equivalent, then stripping the ID3 tags and hashing should do the trick. After hashing, you can put the ID3 tags back if you like.

Mark Brackett