views:

348

answers:

3

I'm developing a server that should receive nightly reports from hundreds of business units. Reports are currently encrypted csv-files. In total the reports should amount to 500 000 to 1 000 000 records each day that are saved to a database for later use.

I've create a set of PreparedStatements for each transmission. These statements are used to batch 50 records before executing and commiting. Each record may cause up to 20 database inserts. All is well when transmissions are queued and handled one-by-one.

As I tried to do this concurrently I noticed that different threads got the exact same instances of the PreparedStatements. This caused the following problem

  1. Multiple threads added statements to the same batch
  2. Batches were being executed when any of the threads decided it was time to do so
  3. Commit was called when database did not meet it's constraints as some of the threads had not had time to use some of the statements

The question is: Is there are way to force a prepared statement to be created instead of reusing an existing one from the statement cache?

If not is there any better way to handle the situation than by

  • creating a separate data source for the batches that does not have statement/connection pooling
  • dropping constraints from the database; insert order would not matter anymore
  • forcing sequential processing


Edit: Attempt to clarify the problem

Let there be threads T1 and T2. Let there be prepared statements S1 and S2. Let there be batches B1 and B2.

Each time S1 is used, it is added to B1. Each time S2 is used, it is added to B2. When commiting, S1 must be commited before S2 per foreign key constraint.

Problem occurs when

  • T1 processes transmissions gleefully
  • T2 processes transmissions innocently
  • T1 uses statement S1 adding s1a to batch B1 containing s1a
  • T1 uses statement S2 adding s2a to batch B2 containing s2a
  • T1 decides it is time to commit
  • T1 commits batch B1 containing s1a
  • T2 uses S1 adding s1b to batch B1 containing s1b
  • T2 uses S2 adding s2b to batch B2 containing s2a, s2b
  • T1 commits batch B1 containting s2a, s2b
  • Database says 'no no' as s2b is commited before s1b which is forbidden in the foreign key.

This can be avoided with manual synchronization as well as pointed in the answers but then I still have to track separately the size of each batch instead of applying logic local to each thread.

+1  A: 

Are you trying to use multiple statements there from a single connection instance? IMO, a connection pool is recommended for the behaviour you describe. The alternative is to synchrnonize manually.

Everyone
Thank you for your answer. Each thread has one connection. Each connection has multiple statements, one PreparedStatement instance for each separate SQL statement. Each statement contains a batch. The problems is that because of statement caching, each thread does not have a unique set of statements which causes problems with the batches. Connection pool and statement cache do not really help here as the connection and statement preparation events are few and far between.
Aleksi
Still trying to understand the issue - is the order of execution for the statements the problem?
Everyone
I edited the question for (hopefully) some clarification of the scenario. Execution order causes the crash, yes. On the other hand shared statements and batches make it impossible to trust the local state. This, in my opinnion, causes avoidable / non-beneficial concurrency in my scenario i.e. managing the batches separate from the actual working thread.
Aleksi
Ah. No.. to my understanding a mechanism to identify which Thread may execute which statement(s) needs to be provided manually. )+: Sorry I couldn't be more help.
Everyone
+1  A: 

The solution is vendor-specific.

If your code runs under a servlet, then you might be able to solve your problem by configuring the datasource in your webapp. I have done that with Oracle driver under Tomcat, but I'm sure other application servers have similar ways to configure connection pooling.

If your code is standalone, then you'll have to use vendor-specific API. As you will target Oracle as your production database, here's a quick example for Oracle JDBC driver:

import oracle.jdbc.OracleConnection;

...

public static void disableStatementCaching(java.sql.Connection conn)
        throws SQLException {
    ((OracleConnection)conn).setImplicitCachingEnabled(false);
}

...

For more info, see JDBC dev guide for Oracle 10.2

Juris
Thanks for you answer. I will definitely have a look in vendor specific APIs even though I'm bit skeptical about using them. The application doesn't run as a servlet. It's published as a web service as defined by EJB3 @WebService -annotation. However, using an application server defined datasource and configuring it is entirely possible. It just needs proper documentation which probably will lead to a situation where a future developer will decide to optimize the performance by enabling statement caching.
Aleksi
A: 

My current solution is stop worrying and start loving the shared batches. I have split the processing algorithm to two phases

  1. Parse a set of N records and save them in an intermidiate format
  2. Persist the set of N records as a batch when a lock is awarded to the current thread

This allows the parsing to concurrent and batching sequential. I'll just have to find a sweet spot to minimize the waiting time between threads.

The quest for a sweet spot may lead to implementing some sort of a two-phased locking scheme i.e. let each thread do as they please and on commit, make sure all threads have completed their current record before the actual batch execution.

In the latter solution it might be necessary to synchronize over parameter setting for each PreparedStatement although I haven't tested if that causes any trouble. It should.

Aleksi