views:

235

answers:

9

I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.

The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".

Files can also be added to this collection during runtime, so it's not a fixed amount of files.

The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.

So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.

This has to be doable on Windows, C is an option but I would prefer C#.

+3  A: 

Do you have anything against storing these files in a database?

A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files

Neil N
there is not searching or sorting, they are indexed in the way that they are appended, will never be reordered or selected in any other way then the order they are put in.
thr
That is still "searching"
Neil N
Not to mention RETURNING them, (which is what you are looking for) would still be faster than a file based method.
Neil N
Yes ok technically, but I don't see how a database will be faster then doing 10*2048 to get to the correct position.
thr
Because unless your going to keep a bunch of file handles persistent, the DB was designed to do exactly this, and to do it better than straight disk IO.
Neil N
There isn't really a problem with keeping the file handles persistent, I can have multiple readers and one writer since the file is append only and they can hold the handles for as long as they want.
thr
But I see your point about the DB, I will see what I can cook up in SQL Server
thr
+2  A: 

I think your idea is probably the best you can do with decent work.

Alternatively you could buy a solid state disk and not care about the filesize.

Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).

Or you could use a database, but the overhead here will be substantial.

Foxfire
yeah this actually struck me as a viable alternative, going local memory should speed up things differently. The reason I discarded that version first was that in the original data it was supposed to be 2gb of files and not 200mb of files (will max grow to 400mb or so, about 200k files), but I might be able to cram it into the memory.
thr
Memory it is, tested it and it works perfectly :)
thr
+2  A: 

That sounds like a reasonable option.

When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.

Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.

Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.

Was there anything more to the question than just "is this a sane thing to do"?

Jon Skeet
Ah yeah good point about reading the whole range into memory and then decode it into the proper pieces. Well my question was really if there's any faster way to do this then what I've come up with.
thr
+1  A: 

Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?

I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?

Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.

You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.

Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.

If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.

Chris Lively
Good points, but I'm rather sure on the constraints: files will never be deleted, only appended.This: "Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability." isn't true though, you can't have multiple writers (or well you can, but really hard to get right) - but you can have multiple readers and one writer as long as you only append with the reader.
thr
A: 

Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?

Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network

JMarsch
Good points, I will evaluate this. Right now I'm leaning against putting them all in memory though, should be fastest and I don't have to go through the hassle of keeping "The File" in check.
thr
A: 

I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.

You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.

The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.

Jerry Coffin
Also a good idea, I think a combination between your idea and what john skeet said would be best, if I need file 1000-1100 just read from 1000*2048 to 1100*2048 and then sort it out in memory when it's all there just using one whole disk read for all of it (as you suggested, but without the packing of files)
thr
A: 

You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes. Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.

See also here for a related question on SO.

MartinStettner
A: 

You can simply concatenate all the files in one big file 'dbase' without any header or footer.

In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.

This scheme allows you to fast read the required files, and to add new ones at the end of your collection.

am
A: 

Interestingly, this problem reminds me of the question in this older SO question:

Is this an over-the-top question for Senior Java developer role?

quillbreaker