tags:

views:

60

answers:

3

I would like to read the last 1 megabyte of a MP3 file and calculate SHA1 checksum for just that part of the file. The reason that I would want this is that when I'm looking for duplicate MP3's, the header info (song title, album etc.) can differ even though it's the exakt same audio file, so I figured I would be better of to checksum a part of the file at the end instead of the whole one. Is there an efficient way of doing this?

+2  A: 

You'd have to use the c wrappers for file manipulation: fopen, fseek and fread:

$size = 1024 * 1000;
$handle = fopen($file, 'r');
fseek($handle, -$size);
$limitedContent = fread($handle, $size);
$hash = md5($limitedContent);
soulmerge
Thank you so much!
Johan
Warning: do not forget proper error handling!
soulmerge
+2  A: 

Try fseek. This will move the pointer to ~1024 kbytes before the end of the file.

 fseek($fp, -1024 * 1024, SEEK_END);
St. John Johnson
+4  A: 

MP3s don't have any inherent "header" info for song/album/artist. That's handled by ID3, which can either be at the front of the file (ID3v2, random size, depending on how much information has been specified) or at the end (ID3v1, fixed 128 bytes). To properly identify an MP3 by checksumming, you'd have to make sure that both versions of the ID3 tag are ignored. Furthermore, it's possible to have MP3s embedded in a .wav container, in which case there's .wav headers and whatnot.

And of course, there's always the case of having two songs encoded with different bitrates, sampling rates, and even different CD rippers and encoders. All will produce utterly different files, but are still "the same song".

Marc B
Very interesting, thanks for the detailed information.
Johan