views:

409

answers:

3

This is a bit of a stretch, but I have an interesting (to me) programming (err... scripting? algorithmic? organizational?) problem. (I am tagging this in Ruby, because of my preference for Ruby for scripting.)

Imagine you have 100 gigabytes of pictures floating around on multiple drives. There are likely a total of 25 gigabytes of unique pictures. The rest are either duplicates (with the same filename), duplicates (with a different name), or smaller versions of the picture (exported for email). Of course, aside from these being on multiple drives, they also are in different folder structures. For instance, img_0123.jpg might exist (in the Windows world) as c:\users\username\pics\2008\img_0123.jpg, c:\pics\2008\img_0123.jpg, c:\pics\export\img_0123-email.jpg, and d:\pics\europe_2008\venice\bungy_jumping_off_st_marks.jpg.

Back in the day we used to have to put everything in folders, and rename them pretty little names (like above). Today, search and tagging takes care of all of this and is redundant (and makes it difficult to organize).

In the past, I have tried moving everything to one drive, written a ruby script to scan for duplicates (I don't trust those dupfinder programs - I ran one, and it started deleting everything!), and tried reorganizing them. However, after a few days, I have given up (on the organizing and manually deleting part).

I am about to embark on a new thought. First copy all the pictures from all of my drives onto a new drive, in ONE folder. Anything with duplicate file names will need to be manually checked. Then fire up Picasa, and manually scan the files and delete duplicates myself (using the good ol' noggen).

However, I am very dissatisfied that I couldn't easily solve this programmatically and am interested in hearing some other solutions, either programmatically or otherwise (maybe writing code isn't the best solution, gasp!).

+2  A: 

Have you considered taking an md5 checksum of each file and determining duplicates that way? If you did that you wouldn't have to manually resolve duplicates.

I would checksum each file and check it against a dictionary of already processed files. If it turns up as a duplicate I would shoot it off to a duplicates directory rather than delete it entirely.

Simucal
Yes, that is indeed how I implemented the Ruby script I used to detect duplicates. However, I am hoping to gain some insight into the problem at a higher level. It seems to me more of a architectural issue of managing and organizing that I am trying to solve.
+5  A: 

I like my photos to be sorted by date taken, so I wrote a groovy script to look at the EXIF data of pictures and put them into directories in ISO date format (2008-12-11). It keeps them organised. It doesn't solve the tagging according to content though, I use flickr for that.

As for the duplication problem, a checksum would cut down on the number of images you'd have to manually sort but unfortunately it wouldn't pick up the resized images. You could look for a less crappy dupe finder, one that doesn't automatically delete duplicates? Be sure to make a backup before you test any though :p

Kenny
Could you share your groovy script?
I'm at work at the moment but sure, if I remember I'll post it when I get home. It's not extensively tested but has worked for me thus far.
Kenny
Kenny, have you considered posting your groovy script online? I'd like to do exactly what you've done.
Nathan DeWitt
+1  A: 

You could use something like Exiftool which does exist for Windows to reorganize your pictures according to the CaptureTime (which is my own scheme) or any other Exif parameters that can be found inside a JPG or RAW file. You'll be able to find duplicates very easily.

Keltia