views:

416

answers:

8

I am going to store 350M pre-calculated double numbers in a binary file, and load them into memory as my dll starts up. Is there any built in way to load it up in parallel, or should I split the data into multiple files myself and take care of multiple threads myself too?

Answering the comments: I will be running this dll on powerful enough boxes, most likely only on 64 bit ones. Because all the access to my numbers will be via properties anyway, I can store my numbers in several arrays.

[update]

Everyone, thanks for answering! I'm looking forward to a lot of benchmarking on different boxes. Regarding the need: I want to speed up a very slow calculation, so I am going to pre-calculate a grid, load it into memory, and then interpolate.

+9  A: 

It sounds extremely unlikely that you'll actually be able to fit this into a contiguous array in memory, so presumably the way in which you parallelize the load depends on the actual data structure.

(Addendum: LukeH pointed out in comments that there is actually a hard 2GB limit on object size in the CLR. This is detailed in this other SO question.)

Assuming you're reading the whole thing from one disk, parallelizing the disk reads is probably a bad idea. If there's any processing you need to do to the numbers as or after you load them, you might want to consider running that in parallel at the same time you're reading from disk.

mquander
I can write a program which will hold that much data in a contiguous array in memory on my 5-year-old notebook. This isn't that unlikely.
liori
@lion: Just curious...how much memory do you have and at what point do you get the `OutOfMemoryException`? Mine cannot even come close to creating the 350M entry double array.
Brian Gideon
My 8GB-of-RAM Dell Precision workstation at work here can't do it.
mquander
@mquander: RAM is irrelevant. It's contiguous virtual address space that matters. RAM is just a hardware performance optimization; it has nothing whatsoever to do with "running out of memory". Memory is *address space*, not *RAM*.
Eric Lippert
liori: it needs 2.8GB RAM just for that array - are you serious that your 5-year-old notebook has more RAM than that and is 64-bit? That must have been a heck of a notebook at the time...
Peter
In the current Microsoft implementation of the CLR, available contiguous address space is pretty much irrelevant once your objects approach 2GB in size. You could have 1TB of available address space, but the max size of a single object will still be hard-limited to 2GB by the CLR, regardless of whether you're running on 32-bit or 64-bit. (Of course, it's reasonably easy to knock-up a composite type to work around this limitation, but in that case you don't necessarily require contiguous address space.)
LukeH
@Peter, @Brian Gideon: "Visual" proof: http://imgur.com/et5CY (note that it takes 3 secounds :-)). Well, I was mistaken, it is 4 year old, and it is Thinkpad X61 with 4GB of memory, C2D T7300, and Debian AMD64. I could actually get 400M... with 450M my OS started swapping heavily, but I guess it would work given enough time ;-) malloc() allows me to allocate 4GB with one call without any troubles, though.
liori
@liori: It seems that the Mono runtime doesn't impose the same arbitrary 2GB object size restriction that the Microsoft CLR does.
LukeH
@LukeH: oh, so there is a limit for MS CLR? Good to know :-)
liori
@Eric Lippert: Yes, I know, which is why I posted the answer without asking the fellow for his specs first -- my remark about RAM was just hammering the point home that having lots may not help. (However, I was not aware of the 2GB limit.)
mquander
liori: Okay, I'm just surprised, it seems like quite a hefty spec for a laptop of that age. Wish my laptop was that grunty :)
Peter
Peter: I bought it used a year ago and with only 2GB, which I expanded. These models are quite cheap now... It was an "upgrade" for my old Compaq TC1000, so you can imagine how fast it felt when I started using it ;-)
liori
+2  A: 

That does not sound like a good idea to me. 350,000,000 * 8 bytes = 2,800,000,000 bytes. Even if you manage to avoid the OutOfMemoryException the process may be swapping in/out of the page file anyway. You might as well leave the data in the file and load smaller chucks as they are needed. The point is that just because you can allocate that much memory does not mean you should.

Brian Gideon
You're making a lot of assumptions here about the machine that the OP will be using.
Martin Smith
@Martin: the OP has the responsibility to indicate what the hardware constraints are; without any information on that then it makes sense to reason conservatively. Frankly, this doesn't sound like a good idea to me either. If I were faced with 350 million floats on disk, I'd never try to read them all into memory at once. I'd read them in chunks as needed. This is a perfectly sensible idea.
Eric Lippert
@Martin: That's a good point especially in light of the edit to the question. I did edit my answer accordingly.
Brian Gideon
@Eric Lippert: If OP asked about an array with 1000 elements, would you require him to state that he doesn't write a program for 8051?
liori
@lion: I think I can say with absolute certainty that any machine capable of running .NET applications would be equally capable of allocating an 8KB array.
Brian Gideon
@Brian Gideon: He did not state that he'll be running .NET. Maybe that's Mono, maybe Portable.NET...
liori
+1  A: 

With a suitable disk configuration, splitting into multiple files across disks would make sense - and reading each file in a separate thread would then work nicely (if you've some stripyness - RAID whatever :) - then it could make sense to read from a single file with multiple threads).

I think you're on a hiding to nothing attempting this with a single physical disk, though.

Will A
+5  A: 

In typical case, loading speed will be limited by speed of storage you're loading data from--i.e. hard drive.

If you want it to be faster, you'll need to use faster storage, f.e. multiple hard drives joined in a RAID scheme.

If your data can be reasonably compressed, do that. Try to find algorithm which will use exactly as much CPU power as you have---less than that and your external storage speed will be limiting factor; more than that and your CPU speed will be limiting factor. If your compression algorithm can use multiple cores, then multithreading can be useful.

If your data are somehow predictable, you might want to come up with custom compression scheme. F.e. if consecutive numbers are close to each other, you might want to store differences between numbers---this might help compression efficiency.

Do you really need double precision? Maybe floats will do the job? Maybe you don't need full range of doubles? For example if you need full 53 bits of mantissa precision, but need only to store numbers between -1.0 and 1.0, you can try to chop few bits per number by not storing exponents in full range.

liori
+5  A: 

The first question you have presumably already answered is "does this have to be precalculated?". Is there some algorithm you can use that will make it possible to calculate the required values on demand to avoid this problem? Assuming not...

That is only 2.6GB of data - on a 64 bit processor you'll have no problem with a tiny amount of data like that. But if you're running on a 5 year old computer with a 10 year old OS then it's a non-starter, as that much data will immediately fill the available working set for a 32-bit application.

One approach that would be obvious in C++ would be to use a memory-mapped file. This makes the data appear to your application as if it is in RAM, but the OS actually pages bits of it in only as it is accessed, so very little real RAM is used. I'm not sure if you could do this directly from C#, but you could easily enough do it in C++/CLI and then access it from C#.

Alternatively, assuming the question "do you need all of it in RAM simultaneously" has been answered with "yes", then you can't go for any kind of virtualisation approach, so...

Loading in multiple threads won't help - you are going to be I/O bound, so you'll have n threads waiting for data (and asking the hard drive to seek between the chunks they are reading) rather than one thread waiitng for data (which is being read sequentially, with no seeks). So threads will just cause more seeking and thus may well make it slower. (The only case where splitting the data up might help is if you split it to different physical disks so different chunks of data can be read in parallel - don't do this in software; buy a RAID array)

The only place where multithreading may help is to make the load happen in the background while the rest of your application starts up, and allow the user to start using the portion of the data that is already loaded while the rest of the buffer fills, so the user (hopefully) doesn't have to wait much while the data is loading.

So, you're back to loading the data into one massive array in a single thread...

However, you may be able to speed this up considerably by compressing the data. There are a couple of general approaches woth considering:

  • If you know something about the data, you may be able to invent an encoding scheme that makes the data smaller (and therefore faster to load). e.g. if the values tend to be close to each other (e.g. imagine the data points that describe a sine wave - the values range from very small to very large, but each value is only ever a small increment from the last) you may be able to represent the 'deltas' in a float without losing the accuracy of the original double values, halving the data size. If there is any symmetry or repetition to the data you may be able to exploit it (e.g. imagine storing all the positions to describe a whole circle, versus storing one quadrant and using a bit of trivial and fast maths to reflect it 4 times - an easy way to quarter the amount of data I/O). Any reduction in data size would give a corresponding reduction in load time. In addition, many of these schemes would allow the data to remain "encoded" in RAM, so you'd use far less RAM but still be able to quickly fetch the data when it was needed.

  • Alternatively, you can very easily wrap your stream with a generic compression algorithm such as Deflate. This may not work, but usually the cost of decompressing the data on the CPU is less than the I/O time that you save by loading less source data, so the net result is that it loads significantly faster. And of course, save a load of disk space too.

Jason Williams
Compression sped up the loads dramatically, thanks!
AlexKuznetsov
@AlexKuznetsov: Cool. Glad it helped. :-)
Jason Williams
+3  A: 

Making this parallel would be a bad idea unless you're running on a SSD. The limiting factor is going to be the disk IO--and if you run two threads the head is going to be jumping back and forth between the two areas being read. This will slow it down a lot more than any possible speedup from parallelization.

Remember that drives are MECHANICAL devices and insanely slow compared to the processor. If you can do a million instructions in order to avoid a single head seek you will still come out ahead.

Also, once the file is on disk make sure to defrag the disk to ensure it's in one contiguous block.

Loren Pechtel
+9  A: 
CriGoT
I appreciate the suggestion, thanks!I did not go for it because just reading from a zipped file was fast enough, and was simpler.
AlexKuznetsov
Really Good to know and definitely it is best to "keep is simple"
CriGoT
A: 

Just saw this : .NET 4.0 has support for memory mapped files. That would be a very fast way to do it, and no support required for parallelization etc.

apoorv020