tags:

views:

1114

answers:

6

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.

Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.

A: 

I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.

To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.

hometoast
+2  A: 

Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.

Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.

Rob Walker
A: 

Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.

The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:

  • Read the entire file into memory
  • Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
  • Executing the generated query (which itself means - for the database - that it has to parse it)

I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.

Cd-MaN
+2  A: 

I'm still researching this option for one of my own projects, but CouchDB may be worth a look.

Ryan Duffield
A: 

why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

uh, the obvious. 100GB of files at <2MB each is at least 50,000 files. searching a file system directory of that many files would be way way slower than searching a metadata db table.
djangofan
here is a tool for this http://www.voidtools.com/
drorhan
+1  A: 

My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.

Ethan Post