views:

380

answers:

3

I am using an NLP library (Stanford NER) that throws OOM errors for rare input documents.

I plan to eventually isolate these documents and figure out what about them causes the errors, but this is hard to do (I'm running in Hadoop, so I just know the error occurs 17% through split 379/500 or something like that). As an interim solution, I'd like to be able to apply a CPU and memory limit to this particular call.

I'm not sure what the best way to do this would be. My first though is to create a fixed thread pool of one thread, and use the timed get() on Future. This would at least give me a wall clock limit which would likely help somewhat.

My question is whether there is any way to do better than this with a reasonable amount of effort.

+2  A: 

I'm not familiar with Hadoop, but don't forget that your JVM will have an implicit upper memory boundary imposed on it (64Mb for a server, if my memory is correct). I would check what memory configuration your JVM is running with (options here)

You can override this by specifying the upper memory limit thus:

java -Xmx512m

to (say) set the limit to 512Mb.

Setting CPU allocation is outside the remit of the JVM, and will be an OS-specific mechanism (if you can do it at all)

If you're dispatching these jobs in parallel from a JVM then running a single-thread (or limited thread) threadpool may well help you. However (again) this is dependent upon your implementation and more details are required.

Brian Agnew
A: 

If all you're trying to do is figure out which documents are crashing, you should put logging around the call to the NLP library, "about to map document x". When you see the OOM, the logs for the mapper will contain the document of doom. Like you said, you should then determine what characteristics of that document cause the library to crash.

In my experience, especially if the documents were created by people on the Internet, you will find some crazy huge document somewhere. At that point you have to decide what to do with such documents; either to ignore them, maybe truncate them.

+1  A: 

Just catch the OutOfMemoryError, log which document you were on, then move on to the next one. The garbage collector will make sure you have enough memory for the next document.

(This is one of the strategies I use with the Stanford dependency parser to move on to the next sentence if one sentence is too long or convoluted to parse.)

Ken Bloom