views:

289

answers:

5

In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).

I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.

Thanks in advance, and let me know if I need to be more specific.


EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).

+3  A: 

The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.

I'd prefer to delegate this task to ext3 which should be rather good at it.

edit :

A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.

The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)

Eric
Well I'd have an index (probably in memory) if I went with the huge files. It's not like I'm going to be searching an entire 8GB file every time I need 2KB of data.
musicfreak
+1  A: 

I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.

My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.

rmeador
Ah man, it does? Didn't know about that... +1
musicfreak
Also, yes I'd split the large files because they contain completely different types of data. But all the data of the same type would be in the same file.
musicfreak
+5  A: 

There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.

Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.

Each file-open operation takes time. A large file only has to be opened once.

And, in considering disk performance, a single file is much more likely to be stored contigously than a large series of files.

...Again, these are generalizations without knowing more about your specific application.

Enjoy,

Robert C. Cartaino

Robert Cartaino
True, unless you can choose which small file to search through. Somehow.
DOK
+3  A: 

It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations

If you go for the lots-of-files solution, I suggest you a structure like

b/a/bar
b/a/baz
f/o/foo

because you have limits on the number of files in a directory.

Stefano Borini
+1  A: 

I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.

We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.

bdk