2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater?

views:

answers:

2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater?

If I have 2 files each with this:

"Hello World" (x 1000)

Does that take up more space than 1 file with this:

"Hello World" (x 2000)

What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?

Update:

I'm using a Macbook Pro, 10.5. But I'd also like to know for Ubuntu Linux.

+1 A:

Most filesystems use a fixed-size cluster (4 kB is typical but not universal) for storing files. Files below this cluster size will all take up the same minimum amount.

Even above this size, the proportional wastage tends to be high when you have lots of small files. Ignoring skewness of size distribution (which makes things worse), the overall wastage is about half the cluster size times the number of files, so the fewer files you have for a given amount of data, the more efficiently you will store things.

Another consideration is that metadata operations, especially file deletion, can be very expensive, so again smaller files aren't your friends. Some interesting work was done in ReiserFS on this front until the author was jailed for murdering his wife (I don't know the current state of that project).

If you have the option, you can also tune the file sizes to always fill up a whole number of clusters, and then small files won't be a problem. This is usually too finicky to be worth it though, and there are other costs. For high-volume throughput, the optimal file size these days is between 64 MB and 256 MB (I think).

Practical advice: Stick your stuff in a database unless there are good reasons not to. SQLite substantially reduces the number of reasons.

Marcelo Cantos 2010-03-30 07:20:04

Agree with one exception: i think cluster is alignment of size. I.e. 1kb will take 4kb and 5kb will take 8kb. And each file is linked with some additional information like name, size, probably cluster-map, attributes and access rights. And for file folder should contain entry pointing to that file. More information - more size on disk

ony 2010-03-30 07:29:04

+1 A:

Files take up space in the form of clusters on the disk. A cluster is a number of sectors, and the size depends on how the disk was formatted.

A typical size for clusters is 8 kilobytes. That would mean that the two smaller files would use two clusters (16 kilobytes) each and the larger file would use three clusters (24 kilobytes).

A file will by average use half a cluster more than it's size. So with a cluster size of 8 kilobytes each file will by average have an overhead of 4 kilobytes.

Guffa 2010-03-30 07:24:31

+3 A:

Marcelos gives the general performance case. I'd argue worrying about this is premature optimization. you should split things into different files where it is logical to split them.

also if you really care about file size of such repetitive files then you can compress them. your example even hints at this, a simple run length encoding of

"Hello World"x1000

is much more space efficient than actually having "hello world" written out 1000 times.

jk 2010-03-30 07:25:30

I'd agree, to a point. I've seen some truly ghastly projects that went for ultra-fine-grained files and bore a hideous cost in performance and maintenance. Ignoring performance at this level can easily cost you three orders of magnitude and a ton of rework.

Marcelo Cantos 2010-03-30 07:35:07

file size isn't really that important in my case. I'm using [jekyll](http://wiki.github.com/mojombo/jekyll/sites) to generate a static site, and I was wondering, if I have objects like a "testimonial" and it is only a sentence plus some keywords and an image tag (all of which I could write in textile), should I combine them into an xml file, or have one small (say 10 line) file for each testimonial. What do you think, from a practical standpoint? These tiny things always get me :)

viatropos 2010-03-30 07:37:31

I agree with Marcelos: there's no such thing as premature optimization when your basic unit of work is the physical movement of HD parts rather than CPU instructions - the difference is more like 5 orders of magnitude.

Michael Borgwardt 2010-03-30 07:42:35

@marcelo and @michael, what is an example of the edge cases you're describing, I'm interested, never imagined something like that.

viatropos 2010-03-30 07:44:01

@viatropos: Does this "site generating" happen once on your machine, and the combined static pages are uploaded to the web server? If so, peformance is obviously irrelevant. But if it happens for every page view, then reeducing the number of files is yout absolute top priority.

Michael Borgwardt 2010-03-30 07:45:35

the combined static pages are uploaded to the server, so performance is irrelevant. but it's good to know when performance is an issue! I'm trying to decide whether or not to have one big xml file and cdata blocks, or multiple small textile files. textile is easier to read, xml is smaller but you can view it all at once. difficult to decide...

viatropos 2010-03-30 07:50:39

@viatropos: have you ever had your computer freeze and be unresponsive for half a minute while the harddisk is working furiously and you can see window contents being updated pixel line by pixel line? That's what happens when the machine is preoccupied with random HD access. Now imagine a system that works like that all the time...

Michael Borgwardt 2010-03-30 07:51:35

I have seen that, that's what it is! usually when I start and stop a local server multiple times per minute trying to resolve quick little things like formatting on a webpage. probably because it has to reread the hundreds of files into memory multiple times.

viatropos 2010-03-30 07:52:39

whether it makes a difference or not is going to depend on the situation. if you have 5 of these testimonials, not so much. if you have 5 million then yes combine them (or shove them in a DB). hence it is premature optiomization if you worry an=bout this in the 5 file case

jk 2010-03-30 08:24:05

I think the usage of file(s) is to take into consideration, according to the API and the language used to read/write them (and hence eventually API restrictions). Fragmentation of the disk, that will tend to decrease with only big files, will penalize data access if you're reading one big file in one shot, whereas several access spaced out time to small files will not be penalized by fragmentation.

snowflake 2010-03-30 07:26:53

Most filesystems allocate space in units larger than a byte (typically 4KB nowadays). Effective file sizes get "rounded up" to the next multiple of that "cluster size". Therefore, dividing up a file will almost always consume more total space. And of course there's one extra entry in the directory, which may cause it to consume more space, and many file systems have an extra intermediate layer of inodes where each file consumes one entry.

What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?

More wasted space
The possibility of running out of inodes (in extreme cases)
On some filesystems: very bad performance when directories contain many files (because they're effectively unordered lists)
Content in a single file can usually be read sequentially (i.e. without having to move the read/write head) from the HD, which is the most efficient way. When it spans multiple files, this ideal case becomes much less likely.

Michael Borgwardt 2010-03-30 07:37:04

ansaurus

tags:

views:

answers:

2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater?

related questions