views:

325

answers:

4

OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.

The program first indexes a large corpus of text files that I provide. This works fine.

Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.

It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).

Things I have done include:

  • revamp my code to use a primitive int[] instead of a TreeMap
  • eliminate redundant structures, etc...
  • Also, I have run the program with-Xmx2g to max out my allocated memory

I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?

Thanks, B.

A: 

If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.

Marc Novakowski
+2  A: 

It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.

If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.

Porges
+1 It sounds like the OP is attempting to create a naive concordance. Although each text may have a 30k word vocabulary, there will be a vast number of zeros in that matrix.
msw
Yup - a sparse array is what came to my mind, but without more details on the intent of the data structure, its difficult to do more than speculate.
Michael Burr
I like the sound of this, I'll give it a shot. @msw, you're close, it's similar to concordance(and very naive;). it's a different statistical analysis approach called Hyperspace Analogue to Language(HAL).
gnomed
this is working for me so far. Now I just have to make it run faster.
gnomed
Cool, glad that it works :)
Porges
+1  A: 

This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).

crunchdog
+2  A: 

There are several causes of out of memory issues.

Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.

If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.

Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.

Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).

Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.

So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:

  • A-M x A-M
  • A-M x N-Z
  • N-Z x A-M
  • N-Z x N-Z

That's just one idea. Again it's hard toc omment without knowing specifics.

cletus
you've mentioned pretty much everything I've tried already, as I mentioned in my post(which was sparsely detailed I admit). Your latter solutions get at my problem, but I am not looking to do anything too complicated(partitioning sounds beyond my scope).
gnomed