views:

722

answers:

3

I want to programmatically create a SHA1 checksum of audio files (MP3, Ogg Vorbis, Flac). The requirement is that the checksum should be stable even if the header (eg. ID3) changes.
Note: The audio files don't have CRCs

This is what I tried by now:

1) Reading + Hashing all MPEG frames using Perl and MPEG::Audio::Frame

my $sha1 = Digest::SHA1->new;
while (my $frame = MPEG::Audio::Frame->read(\*FH)) {
    $sha1->add($frame->content());
}

2) Decoding + Hashing all MPEG frames using Python and libmad (pymad)

mf = mad.MadFile(path)
sha1 = hashlib.sha1()

while 1:
    buf = mf.read()
    if (buf is None):
        break
    sha1.update(buf)

3) Using mp3cat

> mp3cat - - < file.mp3 | sha1sum

However, none of those methods provided a stable checksum. Namely, in some cases the checksum changed after retagging the file with picard.

Are there any libraries that already provide what I want?
I don't care about the programming language…

Update: I debugged the case a bit further. The libmad checksum inconsitency seems to happen in cases where libmad gets some decoding errors, like "Huffman data overrun (0x0238)". As this really happens on many of the mp3 files I'm not sure if it really indicates a broken file…

A: 

Bene, If I were you, (And I am in the process of working on something very similar to what you want to do), I would hash the mp3 data block. (Extract it to raw data first, and write it out to disk, so you know what you are dealing with). Then, modify the ID3 tag. Hash your data again. Now, if it changes, compare your two sets of raw data and find out WHERE it changed. Chances are, you might be over-stepping a boundary somewhere. If I recall, MP3 files start with something like FF F8. Well, at least the frame does.

I'm interested in your findings, as I'm still writing all my code to deal with the finger prints, etc, and haven't gotten to the actual hashing yet.

LarryF
A: 

I'm trying to do the same thing. I used MD5 instead of SHA1. I started to export audio checksums using mp3tag (www.mp3tag.de/en/); then made a Perl script similar to yours to do the same thing. Then I removed all tags from my test file, and the audio checksum remained the same.

This is the script:

use MPEG::Audio::Frame;
use Digest::MD5 qw(md5_hex);
use strict;

my $file = 'E:\Music\MP3\Russensoul\01 - 5nizza , Soldat (Russensoul - Russensoul).mp3';
my $mp3tag_audio_md5 = lc '2EDFBD62995A46A45CEEC08C1F303486';

my $md5 = Digest::MD5->new;

open(FILE, $file) or die "Cannot open $file : $!\n";
binmode FILE;

while(my $frame = MPEG::Audio::Frame->read(\*FILE)){
    $md5->add($frame->asbin);
}

print '$md5->hexdigest  : ', $md5->hexdigest, "\n",
      'mp3tag_audio_md5 : ', $mp3tag_audio_md5,  "\n",
      ;

Is it possible that whatever you use to modify your tags sometimes also modifies mp3 headers?

mivk
+1  A: 

If you are looking for stable hashes for the actual music you might want to look at libOFA. Your current methods will give you different results because the formats can have embedded tags. Also if you want two different files with the same song to return the same hash you need to regard things like bitrate and sample frequencies.

libOFA on the other hand can give you a stable hash that can be used between formats and different encodings. Might be what you want?

Tobias R