views:

777

answers:

4

Hi.

Well, we have a web app, running over JBoss and we're having an "OutOfMemory" error when trying to insert a lot of rows in several tables of a postgres DB. This is the complete environment for this error:
* JBoss 4.3.x GA
* Java 1.6.0
* Hibernate 3.0
* postgreSQL-8.3 (driver)
About actual code-work environment:
* The heavy part about this is that we're parsing huge amounts of xml documents each one downloaded separately from a specific URL (1 URL = 1 XML). We accomplish that by having an EJB that distributes the generated URLs to a queue, then a pool of MDBs connect using streams and generates the documents (note that we've actually had to raise stack memory due to XML documents size, and we're stuck with having to get all the document in one stream), once the document is generated it goes to another queue where another MDB pool listens.
Those MDBs parse the doc, storing information in several entities (5 at least) that then are persisted in the DB (note that the transaction management is set to "BEAN" and is begun and commited during each MDBs work). Processing URLs sequentially is not an option because of the amount of URLs to be processed, it would take like 2 months or so... lol

Trouble is... that we parse and store like 200 URLs or so and start getting out of memory error for postgreSQL. Any ideas??

Thanks in advance!!

ALSO: It may be of use knowing that this error wasn't coming out before (I did parse a few thousands of that XML befor), only generating documents and parsing some of it into some entities didn't seem to bring trouble. Troubles started when we started to parse more and more of the doc into it's correspondent entities. (Like one entity having a list of "features" [other entity parsed from the same xml])

+2  A: 

That sounds like you keep all beans for all 200 documents in memory.

Instead of keeping all the data in RAM the whole time, try to get rid of it as quickly as possible.

So when you are done with reading the data for some bean, persist it and then forget about it (set the object reference to null) except maybe for the key you need to create references.

If you can't do this, use a profiler to make sure that you don't keep a reference to a document after you have parsed it. Usually, you don't need to keep the whole document anywhere after persisting it in the DB. If this is a problem, create a table with references which allows you to link between the documents and just keep the keys in that table in memory (instead of the whole document).

Aaron Digulla
+1  A: 

The short answer is more information is required. I suggest using a tool like JProfiler to help understand where memory is being consumed.

Based on your described design: Why use two pools? If you just had a pool for the URLs, then worker threads could pull from that pool, parse the URL and create the entity beans, saving them to the DB.

It isn't clear why you need the intermediate XML step, which sounds like a probable source of high memory consumption.

A: 

Aaron: I dispose of any Entity bean or document instance as soon as I don't need them anymore, that's actually when I persist them.

Ben: I'm using two pools just to save some time, because getting the doc from URL stream takes like a second and parsing it takes like another, so I try to "simulate" parallel processing instead of doing the downloading and the parsing/persisting sequentially.

Thanks to both!

Juan Manuel
A: 

Here is a likely root cause of such a problem (and solution at low-level, argh!): http://benjchristensen.com/2008/05/27/mysql-jdbc-memory-usage-on-large-resultset/

jmount