ansaurus

Question

Concurrent process inserting data in database

Answer 1

A:

An often used system is to have a primary key that is a UUID ( Unique Universal ID ) and a UUIDGenerator, see http://jug.safehaus.org/ or similar things google has lots of answers

This will prevent the Unique Key constraint to happen

But that offcourse is only a part of your problem, you tx_email_address would still have to be unique and nothing solves that.

There is no way to prevent the constraint violation to happen, as long as you have concurrency you will run into it, and in itself this really is no problem.

Peter 2009-07-14 12:49:40

Problem here is not unique identification of entry, it is about redundant data. And again this process involves writing data for almost 12 different tables which is later used for report generation and other stuff. This redundant data can cause high problem then.Also the application is already in production stage and only issue we are trying to address is the concurrent writing scenario which is again very critical because this application is more or less data driven. Making design changes at this point can open up new can of worms.Is there any solution that can take care of this at DB end.

Salman 2009-07-14 13:17:45

Answer 2

A:

You could expose a public method that queues the write operations and handles queue concurrency, then create another method to run on a different thread (or another process entirely) that actually performs the writes serially.

Jeff Sternal 2009-07-14 12:50:53

Answer 3

A:

You could add concurrency control at the application level this by making the code a critical section:

synchronized(lock) {
  // Code to perform selects / inserts within database transaction.
}

This way one thread is prevented from querying the table while the other is querying and inserting into the table. When the first thread completes, the second thread enters the synchronized block. However, at this point each select attempt will return data and hence the thread will not attempt to insert data.

EDIT:

In cases where you have multiple processes inserting into the same table you could consider taking out a table lock when performing the transaction to prevent other transactions from commencing. This is effectively doing the same as the code above (i.e. serializing the two transactions) but at the database level. Obviously there are potential performance implications in doing this.

Adamski 2009-07-14 12:51:26

that only works if all transactions are started in the same VMIf you have multiple client synchronizing has no use whatsoever

Peter 2009-07-14 12:53:54

This is a fair point. However, I would typically design a system so that one process is the custodian of the data and is responsible for inserting the data (although I sometimes allow multiple readers).

Adamski 2009-07-14 13:05:34

The writer for me is POJO webservice, which again can be clustered. So this option is out. Is there anything that can be done at database level?

Salman 2009-07-14 13:11:16

@Salman: See my most recent edit. You'll have to check out the Postgres documentation for precisely how to obtain the lock though.

Adamski 2009-07-14 13:14:25

Answer 4

A:

The simplest way would seem to be to use the transaction isolation level 'serializable', which prevents phantom reads (other people inserting data which would satisfy a previous SELECT during your transaction).

if (!conn.getMetaData().supportsTransactionIsolationLevel(Connection.TRANSACTION_SERIALIZABLE)) {
    // OK, you're hosed. Hope for your sake your drivers supports this isolation level 
}
conn.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE);

There are also techniques like Oracle's "MERGE" statement -- a single statement which does 'insert or update', depending on whether the data's there. I don't know if Postgres has an equivalent, but there are techniques to 'fake it' -- see e.g. How to write INSERT IF NOT EXISTS queries in standard SQL.

Cowan 2009-07-14 13:58:26

Answer 5

+1 A:

I would first try to design the data flow in a way that only one transaction will ever get one instance of the data. In that scenario the "unique key constraint violation" should never happen and therefore indicate a real problem.

Failing that, I would catch and ignore the "unique key constraint violation" after each insert. Of course, logging that it happened might be a good idea still.

If both approaches were not feasible for some reason, then I would most probably create a transit table of the same structure as "employee", but without primary key constraint and with a "transit status" field. No "unique key constraint violation" would ever happen on the insert into this transit table. A job would be needed, that reads out this transit table and transfers the data into the "employee" table. This job would utilize the "transit status" to keep track of processed rows. I would let the job do different things each run:

execute an update statement on the transit table to set the "transit status" to "work in progress" for a number of rows. How large that number is or if all currently new rows get marked would need some thinking over.
execute an update statement that sets "transit status" to "duplicate" for all rows whose data is already in the "employee" table and whose "transit status" is not in ("duplicate", "processed")
repeat as long as there are rows in the transit table with "transit status" = "work in progress":
- select a row from the transit table with "transit status" = "work in progress".
- Insert that rows data into the "employee" table.
- Set this rows "transit status" to "processed".
- update all rows in the transit table with the same data as the currently processed row and "transit status" = "work in progress" to "transit status" = "duplicate".

I would most probably want another job to regularly delete the rows with "transit status" in ("duplicate", "processed")

If postgres does not know database jobs, an os side job would do.

Juergen Hartelt 2009-07-14 14:18:31

Answer 6

A:

One way to solve this particular problem is by ensuring that each of the individual threads/instances process rows in a mutually exclusive manner. In other words if instance 1 processes rows where tx_email_address = 'test1' then no other instance should process these rows again.

This can be achieved by generating a unique server id on instance startup and marking the rows to be processed with this server id. The way to do it is by -

1. adding 2 columns status and server_id to employee table. 2. Update employee set status='In Progress', server_id='' where status='Uninitialized' and rownum<2 3. commit 4. select * from employee where server_id='' and status='In Progress' 5. process the rows selected in step 4.

Following the above sequence of steps ensures that all the VM instances get different rows to process and there is no deadlock. It is necessary to have update before select to make the operation atomic. Doing it the other way round can lead to concurrency issues.

Hope this helps

Sharad 2009-10-01 06:21:59

ansaurus

tags:

views:

answers:

Concurrent process inserting data in database

related questions