I'm building a database web application using Java and Hibernate's JPA implementation. The application tracks objects. It also has to batch import objects from a legacy source.
For example, let's say we're tracking people. The database has tables called Person and Address. There are corresponding JPA entity and DAO classes.
On top of the JPA layer is a service layer responsible for various operations. One operation is to import a potentially large set of data from an external legacy source (for example people from a phone book). For each person, it has to check if it already exists in the database. Then it has to create or update the person as necessary. Each person has an address so the appropriate cross reference and address creation also has to occur.
My issue is that this operation can be slow for large data sets. My current algorithm is:
for (Person person: allPersons)
{
check if person exists in database
check if address exists in database
create or update person and address as necessary
}
What would you recommend to improve performance?
Off the top of my head I can think of:
- Change the import logic to retrieve and store data to the database using queries. For example, instead of checking if person exists within the for loop, submit all the person key's to database in one query, The process each retrieved person in memory.
- Add my own caching in the DAO classes.
- Use an external caching solution (such as memcached).
I can always go with #1 by restructuring to minimize queries. The downside is that my service layer is now very aware of the DAO layer. Its implementation is now dictated by the lower database layer. There are also other issues like using too much memory. This grab-from-database-then-process-in-memory seems very homegrown and goes against off the shelf solutions like JPA. I'm curious what others would do in this case.
Edit: Caching will not help as each person being queried within the loop is different.