tags:

views:

123

answers:

3

What I want to create is a huge index over an even bigger collection of data. The data is a huge collection of images (and I mean millions of photo's!) and I want to build an index on all unique images. So I calculate a hash value of every image and append this with the width, height and file size of the image. This would generate a very unique key for every image. This would be combined with the location of the image, or location**s** in case of duplicates.

Technically speaking, this would fit perfectly in a single database table. An unique index on file name, plus an additional non-unique index on hash-width-height-size would be enough. However, I could use an existing database system to solve this, or just write my own, optimized version. It will be a single-user application anyway and the main purpose is to detect when I add a duplicate image to the collection so it will warn me that I already have it in my collection and display the locations where the other copies are. I can then decide to still add the duplicate or to discard it.

I've written hash-table implementions before and it's not that difficult once you know what you have to be aware of. So I could just implement my own file format for this data. It's unlikely that I'll ever need to add more information to these images and I'm not interested in similar images, just exact images. I'm not storing the original images in this file either, just the hash, size and location. From experience, I know this could run extremely fast. I've done it before and have been doing similar things for nearly three decades so it's likely that I will chose this solution.

But I do wonder... Doing the same with an existing database system like SQL Server, Oracle, Interbase or MySQL, would performance still be high enough? There would be about 750 TB of images indexed in this database, which roughly translates to around 30 million records in a single, small table. Is it even worth considering the use of a regular database?

I have doubts about the usability of a database for this project. The amount of data is huge, yet the structure is real simple. I don't need multi-user support or most other features that most databases provide. So I don't see a need for a database. But I'm interested in the opinions of other programmers about this. (Although I expect most will agree with me here.)


The project itself, which is still just an idea in my head, is supposed to be some tool or add-on for explorer or whatever. Basically, it builds an index for any external hard disk that I attach to the system and when I copy an image to this disk somewhere, it's supposed to tell me if the image already exists at this disk. It will allow me to avoid filling up my backup disks with duplicates, although I sometimes would like to add duplicates. (E.g. because they're part of a series.) Since I like to create my own rendered artwork I have plenty of images. Plus, I've been taking digital pictures with digital camera's since 1996 so I also have a huge collection of photo's. Add some other large collections to this and you'll soon realise that the amount of data will be huge. (And yes, there are already plenty of duplicates in my collection...)

+1  A: 

Since it's a single-user application that you are considering, I'd probably have a look at SQLite. It ought to fit your other requirements rather nicely, I'd say.

Pukku
Good suggestion, but would it perform at least half as good as writing my own file format? Also annoying is that it might not support Delphi or .NET, although this is probably just a matter of finding the right library. Still, this project is still in the bright-spark-in-my-head phase so I can always decide to write it in C++. :-)
Workshop Alex
Did you look at http://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers ? Also, the project's About page mentions, under "Suggested Uses For SQLite": "Application File Format. Rather than using fopen() to write XML or some proprietary format into disk files used by your application, use an SQLite database instead. You'll avoid having to write and troubleshoot a parser, your data will be more easily accessible and cross-platform, and your updates will be transactional." As for speed, I guess you'll have to test it for your specific application, in your specific environment.
Pukku
Yep, seen that too. But basically I don't even need to write my own parser! I only have two fields that are used as index. And I don't have to search between two values or less than a certain value or whatever. All I need to be able to do is check if a file is already in the database (search filename) and checking if the image is a duplicate. (Checking the key.) But the Delphi libraries seem promising, and this project would be ideal to develop in Delphi.
Workshop Alex
If you are still considering using C++, you might want to check out STXXL (http://stxxl.sourceforge.net/), too.
Pukku
+2  A: 

I would avoid DIY-ing it unless you know all the repocussions of what you're doing.

Transactional Consistency for example, is not trivial.

I would suggest designing your code in such a way the backend can be easily replaced later, and then run with something sane ( SQLite is a good starting choice ), develop it the most sane and rational way possible, and then try slotting in the alternative backing store.

Then profile the differences, and run regression tests against it to make sure your database is not worse than SQLite.

Exisiting database solutions tend to win because they've had years of improvement and fine tuning to get their benefits, an a naïve attempt will likely be slower, buggier, and do less, all the while Increasing your development load to purely MONUMENTAL proportions.

http://fetter.org/optimization.html

  1. The first rule of Optimization is, you do not talk about Optimization.
  2. The second rule of Optimization is, you DO NOT talk about Optimization.
  3. If your app is running faster than the underlying transport protocol, the optimization is over.
  4. One factor at a time.
  5. No marketroids, no marketroid schedules.
  6. Testing will go on as long as it has to.
  7. If this is your first night at Optimization Club, you have to write a test case.

Also, with databases, there is one thing you utterly MUST get ingrained.

Speed is unimportant

Your data being there when you need it, that is important.

When you have the assuredness that your data will always be there, then you may worry about trivial concerns like speed.

Hashes

You also lament that you'll be using image SHA's/MD5's etc to deduplicate images. This is a fallacious notion of its own, Hashes of files are only able to tell if the files are different, not if they're the same.

The logic is akin to asking 30 people to flip a coin, and you see the first one get heads, and thus decide to delete every other person who gets a head, because they're obviously the same person.

http://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer

Although you may think it unlikely you'd have 2 different files with the same hash, your odds are about as good as winning the lotto. The chances of you winning the lotto are low, but somebody wins the lotto every day. Don't let it be you.

Kent Fredric
Statistically speaking, when you use width, height and size of an image then chances of finding two images with the same stats will be a bit difficult already. The hash I use is no SHA/MD5 hash but more a simplified version that I've used before, with the odds of 1 in 2.5 million on duplicate collisions. These two combined should reduce the amount of collisions to a very small number. Even if a collision would occur, I won't delete duplicates automatically, but instead display the image(s) that are already in the collection. Then the user can decide to add the so-called duplicate or not.
Workshop Alex
Speed will be important in this project, though. Part of what I would like to do is to merge photo's from two external disks. The application could copy every unique image from A to B, but must keep a list of duplicates so the user will have to go through the list of duplicates and decide to keep them or just to discard them. Copying files between USB devices is slow, but can be done by a background thread. Finding duplicates needs to be done fast so the user can quickly decide to add them to the background queue or not. I don't want to waste too much user time.
Workshop Alex
+1  A: 

I just tested the performance of PostgreSQL on my laptop (Core 2 Duo T5800 2.0 GHz 3.0 GiB RAM). I have a table with slightly more than 100M records, 5 columns and some indexes. I performed a range query on one indexed column (not the primary key) and returned all columns. A mean query returned 75 rows and executed in 750ms. You have to decide if this is fast enough.

Daniel Brückner
I used a hash table before on an old Pentium 500 MHz system. It contained just a list of debtor numbers and the project had to read a text file with 50.000 records and filter out only those lines that contained an existing debtor number, making a list of unknown numbers. By using my own hash implementation, the whole parsing and checking for duplicates and writing reports on this text file took less than 30 seconds. That's a very short time in milliseconds, per query! So 750 ms for a single query sounds like a long time to me. :-) Still, a good point, if the table had been more complex.
Workshop Alex