views:

462

answers:

2

I'm building a database web application using Java and Hibernate's JPA implementation. The application tracks objects. It also has to batch import objects from a legacy source.

For example, let's say we're tracking people. The database has tables called Person and Address. There are corresponding JPA entity and DAO classes.

On top of the JPA layer is a service layer responsible for various operations. One operation is to import a potentially large set of data from an external legacy source (for example people from a phone book). For each person, it has to check if it already exists in the database. Then it has to create or update the person as necessary. Each person has an address so the appropriate cross reference and address creation also has to occur.

My issue is that this operation can be slow for large data sets. My current algorithm is:

for (Person person: allPersons)
{
    check if person exists in database
    check if address exists in database
    create or update person and address as necessary
}

What would you recommend to improve performance?

Off the top of my head I can think of:

  1. Change the import logic to retrieve and store data to the database using queries. For example, instead of checking if person exists within the for loop, submit all the person key's to database in one query, The process each retrieved person in memory.
  2. Add my own caching in the DAO classes.
  3. Use an external caching solution (such as memcached).

I can always go with #1 by restructuring to minimize queries. The downside is that my service layer is now very aware of the DAO layer. Its implementation is now dictated by the lower database layer. There are also other issues like using too much memory. This grab-from-database-then-process-in-memory seems very homegrown and goes against off the shelf solutions like JPA. I'm curious what others would do in this case.

Edit: Caching will not help as each person being queried within the loop is different.

+1  A: 

There two solutions that I have found that work. One is to process a chunk at a time. After each chunk closing an restarting the session. I have tried to use the flush clear methods on the session but sometimes it just function like you'd expect. Starting and stopping the transaction between batches seems to work the best.

If performance is a major concern you just just break down and do it in JDBC. Hibernate adds too much overhead for batch processing of large datasets, where memory and performance is important.

Ruggs
One of the reasons that flush doesn't always free up the memory is that hibernate will store references to all of the objects saved in the transaction so that it can call post commit hooks on the objects, if any are defined, at the end of the transaction.
davidsheldon
A: 

Your approach is going to result in way too many individual queries against the database; looks like 4n + 1. If possible, I would write a query (perhaps in raw SQL) that checks for the existence of person + address all in one shot.

You may want to work with the StatelessSession instead of the standard Hibernate Session. Since it doesn't have a 1st level cache it should keep your memory requirements lower.

http://www.hibernate.org/hib_docs/reference/en/html/batch-statelesssession.html

If that doesn't work for you then you'll want to take a look at the batch options in Hibernate:

http://www.hibernate.org/hib_docs/reference/en/html/batch.html

cliff.meyers