tags:

views:

60

answers:

1

We have a large number of documents and metadata (xml files) associated with these documents. What is the best way to organize them?

Currently we have created a directory hierarchy:

/repository/category/date(when they were loaded into our db)/document_number.pdf and .xml

We use the path as a unique identifier for the document in our system. Having a flat structure doesn't seem to a good option. Also using the path as an id helps to keep our data independent from our database/application logic, so we can reload them easily in case of failure, and all documents will maintain their old ids. Yet, it introduces some limitations. for example we can't move the files once they've been placed in this structure, also it takes work to put them this way. What is the best practice? How websites such as Scribd deal with this problem?

A: 

Your approach does not seem unreasonable, but might suffer if you get more than a few thousand documents added within a single day (file systems tend not to cope well with very large numbers of files in a directory).

Storing the .xml document beside the .pdf seems a bit odd - If it's really metadata about the document, should it not be in the database (which it sounds like you already have) where it can be easily queries and indexed etc?

When storing very large numbers of files I've usually taken the file's key (say, a URL), hashed it, and then stored it X levels deep in directories based on the first characters of the hash...

Say you started with the key 'http://stackoverflow.com/questions/2734454/how-to-organize-a-large-number-of-objects'. The md5 hash for that is 0a74d5fb3da8648126ec106623761ac5 so you might store it at...

base_dir/0/a/7/4/http___stackoverflow.com_questions_2734454_how-to-organize-a-large-number-of-objects

...or something like that which you can easily find again given the key you started with.

This kind of approach has one advantage over your date one in that it can be scaled to suit very large numbers of documents (even per day) without any one directory becoming too large, but on the other hand, it's less intuitive to someone having to manually find a particular file.

Matt Sheppard
Thanks Matt. The way we currently handle large number of docs in a single day is to split them into subfolders: 1/ 2/ 3/... which is another reason that makes me think there should be a better way...
shane