ansaurus

Question

How can I tell if two image files are the same in Perl?

Answer 1

+1 A:

md5 would work, but you'd still have to pull the file. Are there any useful metadata in the HTTP headers, content-length, cache-control directives, ETags, etc. ?

cms 2009-08-28 18:29:18

Unfortunately not. Pulling the file isn't really an issue though, just don't want to be filling up my HD with dups.

Morinar 2009-08-28 18:31:41

Shame. I'd have thought you could just read the first n KB and compare, if you needed something more optimised than hashing the entire file. You'd probably have to experiment to find a decent n value.

cms 2009-08-28 18:41:24

Now that I look at these more closely, I DO have etags and content-length. I *think* in every instance.

Morinar 2009-08-28 18:58:48

Scratch that... I found at least one that doesn't have an etag.

Morinar 2009-08-28 19:00:02

you could order the tests then, different size, different etag(if present), first chunk, then hash.

cms 2009-08-28 19:26:13

Answer 2

+1 A:

Yep that sounsd right. Depending on how you're getting the file and how frequently you might also be able to check for HTTP 304 Not Modified and save yourself the download.

2009-08-28 18:32:07

Answer 3

+5 A:

Don't open and hash the stored image each time - stash the hash alongside the image when you store it. Compare sizes as well.
Don't issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.

moonshadow 2009-08-28 18:32:32

Haven't implemented this yet, but the more I play with it, the more I realize that this is the correct solution. I'm going to store the last run's header information and then compare to this run's info to determine whether or not to fetch. Thanks for the help all.

Morinar 2009-08-28 19:18:34

Answer 4

+3 A:

There are a number of HTTP headers you can use for this -- if you save the time that you last retrieved the file, you can do a conditional get with

If-Modified-Since: <date>

Or, if the server returns an Etag header with the response, you can store that with the image, (or a collection of all of the etags you have seen for that image), and do:

If-None-Match: <all of your etags here>

If the server supports conditional gets, then you will get a "304 Not Modified" response, with no body.

Ian Clelland 2009-08-28 18:33:15

Answer 5

A:

There's also a nice fdupes tool for the purpose. Don't know what system you're using and what systems the tool can be built for.

Michael Krelin - hacker 2009-08-28 18:38:04

ansaurus

tags:

views:

answers:

How can I tell if two image files are the same in Perl?

related questions