ansaurus

Question

Lots of small files or a couple huge ones?

Answer 1

+3 A:

The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.

I'd prefer to delegate this task to ext3 which should be rather good at it.

edit :

A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.

The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)

Eric 2009-06-26 21:29:58

Well I'd have an index (probably in memory) if I went with the huge files. It's not like I'm going to be searching an entire 8GB file every time I need 2KB of data.

musicfreak 2009-06-26 21:32:38

Answer 2

+1 A:

I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.

My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.

rmeador 2009-06-26 21:30:21

Ah man, it does? Didn't know about that... +1

musicfreak 2009-06-26 21:31:57

Also, yes I'd split the large files because they contain completely different types of data. But all the data of the same type would be in the same file.

musicfreak 2009-06-26 21:39:48

Answer 3

+5 A:

There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.

Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.

Each file-open operation takes time. A large file only has to be opened once.

And, in considering disk performance, a single file is much more likely to be stored contigously than a large series of files.

...Again, these are generalizations without knowing more about your specific application.

Enjoy,

Robert C. Cartaino

Robert Cartaino 2009-06-26 21:30:50

True, unless you can choose which small file to search through. Somehow.

DOK 2009-06-26 21:33:26

Answer 4

+3 A:

It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations

If you go for the lots-of-files solution, I suggest you a structure like

b/a/bar
b/a/baz
f/o/foo

because you have limits on the number of files in a directory.

Stefano Borini 2009-06-26 21:31:46

Answer 5

+1 A:

I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.

We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.

bdk 2009-06-27 02:25:20

ansaurus

tags:

views:

answers:

Lots of small files or a couple huge ones?

related questions