How to manipulate *huge* amounts of data

views:

414

answers:

+11 Q:

How to manipulate huge amounts of data

Hi there!

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).

The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!

I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).

An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.

What would you do if you were in this situation? I'm open to any idea.

Thanks in advance!

EDIT: Sorry for not providing enough information, I'll try to be more specific.

I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.

As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.

EDIT (2):

I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).

+2 A:

Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?

32GB is not really that huge.

How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.

What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.

Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.

Daren Thomas 2010-04-13 13:51:53

I very much doubt there are any home computers that can hold 32GB in memory. But I get your point, opening a 32GB file will not overload the computer if it is 64 bit OS. A lot of that 32GB will be held in swap space

thecoshman 2010-04-13 13:57:09

@thecoshman: yep. I goofed. But fast forward 5 years and 32GB will be *lame* :) I kinda got confused, after working out that he needed roughly 256GB - and thought, *wait-a-minute*, 256? I have 3TB in my laptop. But that is like 3 orders of magnitude from the truth...

Daren Thomas 2010-04-13 14:12:12

THere are. A normal home computer these days can hold 16gb in memory. Add virtual memory (fast disc, SSD) and handling 32gb "just in memory" is actually feasible if not happening too often. Server boards easily hold 128gb these days, but are a lot more expensive.

TomTom 2010-06-06 18:21:27

+3 A:

As Chris pointed out, what are you going to do with the data.

Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.

Henri 2010-04-13 13:52:35

I don't know any modern OS's that don't cache the HD.

Steven Sudit 2010-04-13 14:06:03

It's a 4 dimensional array of floats, using an RDBMS just seems suboptimal when you have 4 known indices.

Stephen 2010-04-13 14:10:54

+1 A:

Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.

David 2010-04-13 13:55:25

The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.

Marcin K 2010-04-13 14:01:56

+2 A:

Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..

Brendan Long 2010-04-13 14:09:02

32 GB is the smallest amount I will use, I'd love to found a scalable solution so I'll be able to use more memory in the future (256 GB?).

Alejandro Cámara 2010-04-13 14:12:18

If you look at the second page I linked to, Amazon goes up to 68 GB of memory while still being reasonably priced (again, as long as you're not using it very often).

Brendan Long 2010-04-13 14:18:05

There's something to be said for renting incredibly capable hardware rather than spending people time trying to optimize the problem. Ten hours of $1.20/hour is $12, which is much less than the minimum consulting fee anybody competent in the field is going to charge.

David Thornley 2010-04-13 15:11:02

Especially when being rare (i.e. once per month or so).

TomTom 2010-06-06 18:21:55

+2 A:

If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.

Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.

Stuart Sierra 2010-04-13 14:09:50

The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.

The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).

Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).

Anon 2010-04-13 14:12:56

actually, no. the compiled / interpreted angle is probably irrelevant for the task in question (check out NumPy et al.).

Daren Thomas 2010-04-13 14:32:12

@Daren: Right - and when the big delay is getting things onto and off disks, some extra CPU time may not matter at all.

David Thornley 2010-04-13 15:02:38

@Daren Thomas - actually, yes. While NumPy might be a very good choice for a pre-written array manipulation tool (and should be an answer on its own), you'll note that parts of it are written in C: http://sourceforge.net/projects/numpy/develop

Anon 2010-04-13 15:16:19

@David Thornley - agreed that moving data into and out of disk is going to be the big issue, which is why I presented a solution that will let the OS do the work.

Anon 2010-04-13 15:18:20

@Anon: Except that the OS won't be smart about predicting what values will be accessed next, and will be moving the array in and out of memory as best it can. For some operations, this is just peachy. For others, it will cause thrashing.

David Thornley 2010-04-13 15:50:34

@David Thornley - sure, but the same thing could be said about any solution where you don't have enough RAM to hold all the data. Which again suggests a multi-dimensional array class that makes some intelligent choices about operation implementation.

Anon 2010-04-13 16:08:12

And since the issue of compiled versus interpreted seems to be keeping people from thinking about the rest of my answer, I'll delete that. Who knows, maybe the next comment will take issue with my "object-oriented" statement and suggest that functional is more important.

Anon 2010-04-13 16:10:13

For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].

Donal Fellows 2010-04-13 14:54:25

@Donal: That's right, but if I slice the array into 2D arrays, say a34[i,j] = a[i,j,3,4] and b25[k,l] = [2,5,k,l], and then I need to operate over a[i,5,k,4], then I have to access **all** the files to obtain the equivalent information contained in just one file. That's what slows down the operations.

Alejandro Cámara 2010-04-13 15:14:54

Could it be possible to solve it by this procedure?

First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.

A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.

Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?

Alejandro Cámara 2010-04-13 15:56:08

As long as they're all accessing different hard drives, this could work.. If it's all the same hard drive, it will just slow you down.

Brendan Long 2010-04-13 17:25:21

@Bredan: The idea is make the cores load the information in the RAM, so there's no HDD access during the mathematical operations. Anyway, the cores are independent, so there would be no problem about it.

Alejandro Cámara 2010-04-13 17:55:13

+1 A:

So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.

C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.

Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.

dboarman 2010-04-13 16:32:23

I'm not quite sure about all that you have said (I'm not quite the programmer type) but the bits I understood sound reasonable. The hardware one means going to spend some money in: 1) SSD drive, 2) computing time. It is a good patch, but not the scalable solution I'm looking for. The other way is to learn C to be able to randomly access files to increase speed respect the current linear way of accessing. This sounds more like it, but it means I've to spend time learning (which doesn't necessaryly have to be a bad thing :)

Alejandro Cámara 2010-04-13 18:11:16

+2 A:

Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.

Check out wikipedia for a description, and to decide if this might meet your needs: http://en.wikipedia.org/wiki/Sparse_matrix

Henry Jackson 2010-04-13 16:34:05

Could not be repeated enough.

johanbev 2010-04-13 17:48:23

That's a good advice, but not applicable in this case. In fact, I try to fit the data so the number of null values is minimum!

Alejandro Cámara 2010-04-13 17:53:44

+1 A:

Here's another idea:

Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.

Brendan Long 2010-04-13 17:26:35

That's something I thought about, but I'm not sure it will decrease the execution time enough. I did a test with N=256 in a regular HDD and it took a month aprox. I don't think it will take less than, what?, two weeks?But it is still a good half-solution to mix up with another ones (migrating to Python or, even better, C).

Alejandro Cámara 2010-04-13 18:03:48

What I've read is that for random access, SSDs are much, much faster. This graph shows SSDs from Intel performing (I think -- it's hard to tell how badly the hard drives are performing) around 100x faster than hard drives: http://www.anandtech.com/show/2738/25 Even the most conservative ones are at least 20x faster.

Brendan Long 2010-04-13 22:04:06

RealSSD has a disc performing 40k IOPS. This is HUGH for randomg access. Like a factor of 100 to 1000. And they get faster.

TomTom 2010-06-06 18:23:16

How to handle processing large amounts of data typically revolves around the following factors:

Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?

If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.

Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.

Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

Joel Hoff 2010-06-06 18:16:53

+1 A:

You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.

lhf 2010-06-06 18:45:01

ansaurus

tags:

views:

answers:

How to manipulate huge amounts of data

related questions