views:

126

answers:

5

I have a database scenario (i'm using Oracle) in which several processes make inserts into a table and a single process selects from it. The table is basically used as a intermediate storage, to which multiple processes (in the following called the Writers) write log events, and from which a single processes (in the following referred to as the Reader) reads the events for further processing. The Reader must read all events inserted into the table.

Currenly, this is done by each inserted record being assigned an id from an ascending sequence. The reader periodically selects a block of entries from the table where the id is larger than the largest id of the proviously read block. E.g. something like:

SELECT
  *
FROM
  TRANSACTION_LOG
WHERE
  id > (
    SELECT
      last_id
    FROM
      READER_STATUS
   );

The problem with this approach is that since writers operate concurrently, rows are not always inserted in order according to their assigned id, even though these are assigned in sequentially ascending order. That is, a row with id=100 is sometimes written after a record with id=110, because the process writing the row with id=110 started after the processes writing the record id=100, but commited first. This can result in the Reader missing the row with id=100 if it has already read row with id=110.

Forcing the Writers to exclusive lock on the table would solve the problem as this would force them to insert sequentially and also for the Reader to wait for any outstanding commits. This, however, would probably not be very fast.

It is my thinking, that it would suffice for the Reader to wait for any outstanding Writer commits before reading. That is, Writers may continue to operate concurrently as longs as the Reader does read until all writers have finished.

My question is this: How can i instruct my reader process to wait for any outstanding commits of my writer processes? Any alternative suggesting to the above problem is also welcome.

+1  A: 

Interesting problem. It sounds like you're building a nice solution.
I hope I can help.

A couple of suggestions...

Writer Status

You could create a table, WRITER_STATUS, which has a last_id field: Each writer updates this table before writing with the ID it is going to write to the log, but only if its ID is greater than the current value of last_id.

The reader also checks this table and now knows if any writers have not yet written.

Reader Log

This may be more efficient.
After the reader does a read, it checks for any holes in the records it's retrieved.
It then logs any missing IDs to a MISSING_IDS table and for its next read it does something like

SELECT *
FROM   TRANSACTION_LOG
WHERE  id > (SELECT last_id
             FROM   READER_STATUS)
OR     id IN ( SELECT id from MISSING_IDS )
AJ
Thanks. Certainly worth considering. I'll will be trying various thins out in the next couple of days.
aggergren
Cool. please update your question to let us know how you get on (or even better, post an answer). It's one of those problems that crops up occasionally, and a good solution is golddust! Good luck.
AJ
+1  A: 

You might want to put an exclusive lock on the table in the reader process. This will wait until all writers finish and release their row locks, so you can be sure there are no outstanding writer transactions.

csgero
yes, i aggree that this will cause the Reader to wait. Question is, what performance hit it'll cause.
aggergren
Depends on how often and how long the reader process runs. This solution will at least allow multiple writers to run parallel, so if writing happens significantly more often then reading the performance hit should be bearable.
csgero
+1  A: 

I wouldn't do any locking, that can interfere with concurrency and throughput.

You don't need the Reader_Status table either, if you keep track of which log rows you've processed on a row by row basis.

Here's what I'd do: add a new column to your log table. Call it "processed" for example. Make it a boolean, defaults to false (or small integer, defaults to 0, or whatever). The Writers use the default value when they insert.

When the Reader queries for the next block of records to process, he queries for rows where processed is false and the id value is low.

SELECT * FROM Transaction_Log
WHERE processed = 0
ORDER BY id
LIMIT 10;

As he processes them, the Reader uses UPDATE to change processed from false to true. So the next time the Reader queries for a block of records, he is sure he won't get rows he has already processed.

UPDATE Transaction_Log
SET processed = 1
WHERE id = ?;  -- do this for each row processed

This UPDATE shouldn't conflict with the INSERT operations done by the Writers.

If any rows are committed out of sequence by other Writers, the Reader will see them next time he queries, if he always processes them in order of the id column from lowest value to highest value.

Bill Karwin
A: 

Since you know last_id processed by Reader, you can request next work item in this manner:

select * from Transaction_log where id = (
  select last_id + 1 /* or whatever increment your sequencer has */
    from Reader_status)
Constantin
A: 

I agree with AJ's solution ( link ). Additionally following suggestions may help to reduce the number of holes.

1) Use Oracle Sequence to create the id and use 'auto increment' like as follows

INSERT INTO transaction_table VALUES(id__seq.nextval, );

2) Use autoCommit(true) so that insert will commit immediately.

These two steps will reduce the number of holes substantially. Still there is a possibility that some inserts started first but got committed later and a read operation happened in between.

Rejeev Divakaran