views:

117

answers:

5

I have a Perl script I wrote for my own personal use that fetches image files from a website periodically. It then saves these images to a folder. These image files are quite often the same from fetch to fetch, and I'd like to not save duplicates if I can get around it.

My question: What would be the best way to compare/check if they are the same?

My only real thought so far is to open a file handle to existing one, md5 it, md5 the $response->content from the fetch and then compare them. Would that work?

Is there a better way?

EDIT:

Wow, already tons of great suggestions. Does it help if I tell you that this script runs daily via cron? I.e. it is guaranteed to always run at the exact same time everyday? Also: I'm looking at the last-modified headers on some of these, and they don't look 100% accurate, i.e. there are some that have a last-modified of over a week ago when I know the image is more recent than that. I'm assuming that's because the image file itself hasn't been modified on the server since then... which doesn't help me much...

+1  A: 

md5 would work, but you'd still have to pull the file. Are there any useful metadata in the HTTP headers, content-length, cache-control directives, ETags, etc. ?

cms
Unfortunately not. Pulling the file isn't really an issue though, just don't want to be filling up my HD with dups.
Morinar
Shame. I'd have thought you could just read the first n KB and compare, if you needed something more optimised than hashing the entire file. You'd probably have to experiment to find a decent n value.
cms
Now that I look at these more closely, I DO have etags and content-length. I *think* in every instance.
Morinar
Scratch that... I found at least one that doesn't have an etag.
Morinar
you could order the tests then, different size, different etag(if present), first chunk, then hash.
cms
+1  A: 

Yep that sounsd right. Depending on how you're getting the file and how frequently you might also be able to check for HTTP 304 Not Modified and save yourself the download.

+5  A: 
  • Don't open and hash the stored image each time - stash the hash alongside the image when you store it. Compare sizes as well.

  • Don't issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.

moonshadow
Haven't implemented this yet, but the more I play with it, the more I realize that this is the correct solution. I'm going to store the last run's header information and then compare to this run's info to determine whether or not to fetch. Thanks for the help all.
Morinar
+3  A: 

There are a number of HTTP headers you can use for this -- if you save the time that you last retrieved the file, you can do a conditional get with

If-Modified-Since: <date>

Or, if the server returns an Etag header with the response, you can store that with the image, (or a collection of all of the etags you have seen for that image), and do:

If-None-Match: <all of your etags here>

If the server supports conditional gets, then you will get a "304 Not Modified" response, with no body.

Ian Clelland
A: 

There's also a nice fdupes tool for the purpose. Don't know what system you're using and what systems the tool can be built for.

Michael Krelin - hacker