tags:

views:

123

answers:

8

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.

My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?

I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.

Thanks very much in advance for your time, Dan

+1  A: 

Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.

But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.

rogeriopvl
+1  A: 

The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.

Gray Area
+1  A: 

Given your data is 1 MB, I would even consider to store it entirely in memory.

To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.

Didier Trosset
+1  A: 

Opening File and Closing file in C Would take much time i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)

mihirpmehta
+3  A: 

Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.

AProgrammer
Constructing the appropriate filename makes a lot of sense and shouldn't be too tough to make work. Thanks very much.
Dan
+4  A: 

Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.

You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.

Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.

You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.

caf
Thanks very much for your input. Indexing is certainly something to consider. The database restriction isn't a constraint we've got control over unfortunately...
Dan
Using "partitioned directories" is not something you do for performance reasons, it is purely a way to add scalability when you need to handle lots and lots of files (we're talking hundred thousand files in a single dir here).
Martin Wickman
...and the reason that thousands of files in a single directory is bad is: it's slow.
caf
+1  A: 

This all depends on your file system, block size and memory cache among others.

As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.

(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).

Martin Wickman
A: 

Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.

Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Eli Bendersky