ansaurus

Question

Answer 1

+2 A:

I'm not familiar with Hadoop, but don't forget that your JVM will have an implicit upper memory boundary imposed on it (64Mb for a server, if my memory is correct). I would check what memory configuration your JVM is running with (options here)

You can override this by specifying the upper memory limit thus:

java -Xmx512m

to (say) set the limit to 512Mb.

Setting CPU allocation is outside the remit of the JVM, and will be an OS-specific mechanism (if you can do it at all)

If you're dispatching these jobs in parallel from a JVM then running a single-thread (or limited thread) threadpool may well help you. However (again) this is dependent upon your implementation and more details are required.

Brian Agnew 2009-07-04 09:38:52

Answer 2

A:

If all you're trying to do is figure out which documents are crashing, you should put logging around the call to the NLP library, "about to map document x". When you see the OOM, the logs for the mapper will contain the document of doom. Like you said, you should then determine what characteristics of that document cause the library to crash.

In my experience, especially if the documents were created by people on the Internet, you will find some crazy huge document somewhere. At that point you have to decide what to do with such documents; either to ignore them, maybe truncate them.

2009-07-04 17:29:12

Answer 3

+1 A:

Just catch the OutOfMemoryError, log which document you were on, then move on to the next one. The garbage collector will make sure you have enough memory for the next document.

(This is one of the strategies I use with the Stanford dependency parser to move on to the next sentence if one sentence is too long or convoluted to parse.)

Ken Bloom 2009-11-11 20:42:28

ansaurus

tags:

views:

answers:

Limit CPU / Stack for Java method call?

related questions