views:

254

answers:

3

Hello,

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...

The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:

  1. Give me all files bigger than 100MB
  2. Show all files older than 3 days
  3. Get me all files ending with docx

But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".

Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.

Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".

So, now to get to the point:

Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.

One very rough approach would be this:

  1. In the beginning, all files are equal
  2. The first, not so "good" group is the directory
  3. If you are a big, clean directory, you earn points (evenly distributed names)
  4. If all files have the same creation date, you may be "autocreated"
  5. If you are a child of Program-Files, I don't care for you at all
  6. If I move you, group A, into group C, would this improve the "entropy"

What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!

Edit in reacation to answers:

The tagging approach: Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..

The procrastination comment: Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)

Chris

+2  A: 

I don't have a solution (and would love to see one), but I might suggest extracting metadata from your files besides the obvious name, size and timestamps.

  • in-band metadata such as MP3 ID3 tags, version information for EXEs / DLLs, HTML title and keywords, Summary information for Office documents etc. Even image files can have interesting metadata. A hash of the entire contents helps if looking for duplicates.
  • out-of-band metadata such as can be stored in NTFS alternate data streams - eg. what you can edit in the Summary tab for non-Office files
  • your browsers keep information on where you have downloaded files from (though Opera doesn't keep it for long), if you can read it.
Hugh Allen
+4  A: 

You're procrastinating. Stop that, and clean up your mess. If it's really big, I recommend the following tactic:

  1. Make a copy of all the stuff on your drive on an external disk (USB or whatever)
  2. Do a clean install of your system
  3. As soon as you find you need something, get it from your copy, and place it in a well defined location
  4. After 6 months, throw away your external drive. Anything that's on there can't be that important.

You can also install Google Desktop, which does not clean your mess, but at least lets you search it efficiently.

If you want to prevent this from happening in the future, you have to change the way you're organizing things on your computer.

Hope this helps.

Rolf
Thanks mom ;) Just kidding - tough love is a good thing too!
David Robbins
+1  A: 

You've got a fever, and the only prescription is Tag Cloud! You're still going to have to clean things up, but with tools like TaggCloud or Tag2Find you can organize your files by meta data as opposed to location on the drive. Tag2Find will watch a share, and when anything is saved to the share a popup appears and asks you to tag the file.

You should also get Google Desktop too.

David Robbins