tags:

views:

82

answers:

2

I'm trying to write a Python script for searching out duplicate mp3/4 files using the song's data as the base for comparison. My situation involves many mp3/4 files with similar file names, but different ID3 tags. At first I tried looping through and using md5 to find duplicate files (ignoring file names). This, of course, didn't work when the ID3 tags didn't match.

As a result, I'm looking for a way to extract only the music data from an mp3/4 in order to run it through md5 and find any duplicates. What is the best way to go about this?

A: 

That's actually pretty advanced, fuzzy logic-type stuff you're asking about.

This isn't an answer but take a look at the discussion in this article: http://stackoverflow.com/questions/476227/detect-duplicate-mp3-files-with-different-bitrates-and-or-different-id3-tags (It might qualify as a dupe actually... It's even Python-specific.)

Paul Sasik
Completely different problem. These files are copies of the same MP3 with different ID3 tags, since iTunes tries to be smart and update the ID3 tags. There should be no binary difference in the music, only the meta data. Thanks for answering, though. =-]
Jack M.
+2  A: 

Try using id3-py or mutagen to strip out all the tags (both ID3v1 and ID3v2, they can both be on the same file), then computing the MD5 on the result.

Assuming iTunes didn't manipulate the file beyond tags they should be identical. Transcoding obviously would make this approach invalid.

Nick T
While quite disk intensive, stripping out the tags with Mutagen worked out pretty darn well.
Jack M.