Best way to store many files in disk

views:

208

answers:

+6 Q:

Best way to store many files in disk

I couldn't find a good title for the question, this is what I'm trying to do:

This is .NET application.
I need to store up to 200000 objects (between 3KB-500KB)
I need to store about 10 of them per second from multiple-threads
I use binaryserialization before storing it
I need to access them later on by an integer, unique id

What's the best way to do this?

I can't keep them on memory as I'll get outofmemory exceptions
When I store them in the disk as separate files what are the possible performance issues? Would it decrease the overall performance much?
Shall I implement some sort of caching, for example combine 100 objects and write it once as one file. Then parse them later on. Or something similar?
Shall use a database? (access time is not important, there won't be search and I'll access only couple of times by the known unique id). In theory I don't need a database, I don't want to complicate this.

UPDATE:

I assume database would be slower than file system, prove me wrong if you got something about that. So that's why I'm also leaning towards to file system. But what I'm truly worried is about writing 200KB*10 per second to HDD (this can be any HDD, I don't control hardware, it's a desktop tool which will be deployed in different systems).
If I use file system I'll store files in separate folders to avoid file-system related issues (so you can ignore that limitation)

+2 A:

I would be tempted to use a database, in C++ either sqlite or coucheDB.
These would both work in .Net but i don't know if there is a better .Net specific alternative.

Even on filesystems that can handle 200,000 files in a directory it will take for ever to open the directory

Edit - The DB will probably be faster!
The filesystem isn't designed for huge numbers of small objects, the DB is.
It will implement all sorts of clever caching/transaction stratergies that you never thought of.

There are photo sites that chose the filesystem over a DB. But they are mostly doing reads on rather larger blobs and they have lots of admins who are expert in tuning their servers for this specific application.

Martin Beckett 2010-02-09 15:02:44

Is there any performance advantage of any database? I assume it'll be slower, if there is, What's the advantage over file system (assuming I'll group files 1000 per folder in the file system - which solves to open directory problem easily).

dr. evil 2010-02-09 15:35:08

+4 A:

If you want to avoid using a database, you can store them as files on disk (to keep things simple). But you need to be aware of filesystem considerations when maintaining a large number of files in a single directory.

A lot of common filesystems maintain their files per directory in some kind of sequential list (e.g., simply storing file pointers or inodes one after the other, or in linked lists.) This makes opening files that are located in the bottom of the list really slow.

A good solution is to limit your directory to a small number of nodes (say n = 1000), and create a tree of files under the directory.

So instead of storing files as:

/dir/file1 /dir/file2 /dir/file3 ... /dir/fileN

Store them as:

/dir/r1/s2/file1 /dir/r1/s2/file2 ... /dir/rM/sN/fileP

By splitting up your files this way, you improve access time significantly across most file systems.

(Note that there are some new filesystems that represent nodes in trees or other forms of indexing. This technique will work as well on those too.)

Other considerations are tuning your filesystem (block sizes, partitioning etc.) and your buffer cache such that you get good locality of data. Depending on your OS and filesystem, there are many ways to do this - you'll probably need to look them up.

Alternatively, if this doesn't cut it, you can use some kind of embedded database like SQLlite or Firebird.

HTH.

0xfe 2010-02-09 15:05:14

I don't control the hardware so it can be anything from a crappy HDD with FAT32 (not likely but possible) to a RAID. OS is always Windows though this is .NET in Windows, no mono stuff.

dr. evil 2010-02-09 15:38:53

@dr. evil: I think that in case of a "crappy HDD" incapable of storing 2MB/sec any solution including DMBS will fail, simply because any DBMS adds its own overhead while storing data.

Igor Korkhov 2010-02-09 18:28:20

+1 A:

you can check out mongoDb, it support store files.

Benny 2010-02-09 15:05:30

Is there any performance advantage of MongoDB? I assume it'll be slower, if it's what's the advantage over file system (assuming I'll group files 1000 per folder in the file system)

dr. evil 2010-02-09 15:33:58

The only way to know for sure would be to know more about your usage scenario.

For instance, will later usage of the files need them in clusters of 100 files at a time? Perhaps if it does it would make sense to combine them.

In any case, I would try to make a simple solution to begin with, and only change it if you later on find that you have a performance problem.

Here's what I would do:

Make a class that deals with the storage and retrieval (so that you can later on change this class, and not every point in your application that uses it)
Store the files on disk as-is, don't combine them
Spread them out over sub-directories, keeping 1000 or less files in each directory (directory access adds overhead if you have many files in a single directory)

Lasse V. Karlsen 2010-02-09 15:06:21

usage scenario is clear as explained in the question, later usage is not important at all, I'll only access it 0-10 times, and need to access by ID. It doesn't important how long does as soon as it's under 15-30 seconds.

dr. evil 2010-02-09 15:32:43

I actually don't use .NET so I'm not sure what is easy there, but in general I'd offer two pieces of advice.

If you need to write a lot and read rarely (e.g. log files), you should create a .zip file or the like (choose a compression level that doesn't slow down performance too much; in the 1-9 rating, 5 or so usually works for me). This gives you several advantages: you don't hit the filesystem so hard, your storage space is reduced, and you can naturally group files in blocks of 100 or 1000 or whatever.

If you need to write a lot and read a lot, you could define your own flat file format (unless you have access to utilities to read and write .tar files or the like, or cheat and put binary data in an 8-bit grayscale TIFF). Define records for each header--perhaps 1024 bytes each that contains the offset into the file and the file name and anything else you need to store--and then write the data in chunks. When you need to read a chunk, you first read the header (perhaps 100k) and then jump to the offset you need and read the amount that you need. The advantage of fixed-size headers is that you can write empty data to them at the beginning and then just keep appending new stuff to the end of the file and then go back and overwrite the corresponding record.

Finally, you could possibly look into something like HDF5; I don't know what the .NET support for that is, but it's a good way to store generic data.

Rex Kerr 2010-02-09 15:22:14

You might consider using Microsoft's Caching Application Block. You can configure it to use IsolatedStorage as a backing store, so items in the cache will be serialized to disk. Performance could be a problem - I think that out of the box it blocks on writes, so you might need to tweak it to do async writes instead.

Jason 2010-02-09 15:56:46

Check Solid File System, which provides a great hierarchical storage (actually a virtual file system) for your data.

Eugene Mayevski 'EldoS Corp 2010-04-06 21:36:19

ansaurus

tags:

views:

answers:

Best way to store many files in disk

related questions