views:

85

answers:

2

We have this PHP application which selects a row from the database, works on it (calls an external API which uses a webservice), and then inserts a new register based on the work done. There's an AJAX display which informs the user of how many registers have been processed.

The data is mostly text, so it's rather heavy data.

The process is made by thousands of registers a time. The user can choose how many registers to start working on. The data is obtained from one table, where they are marked as "done". No "WHERE" condition, except the optional "WHERE date BETWEEN date1 AND date2".

We had an argument over which approach is better:

  • Select one register, work on it, and insert the new data
  • Select all of the registers, work with them in memory and insert them in the database after all the work was done.

Which approach do you consider the most efficient one for a web environment with PHP and PostgreSQL? Why?

A: 

If the data is heavy and so is the load, considering the application is not real time dependant the best approach is most definately getting the needed data and working on all of it, then putting it back.

Efficiency speaking, regardless of language is that if you are opening single items, and working on them individually, you are probably closing the database connection. This means that if you have 1000's of items, you will open and close 1000's of connections. The overhead on this far outweighs the overhead of returning all of the items and working on them.

Kyle Rozendo
It did not sound like he was starting a new PHP script for each item. Even if he was, then pconnect would offset that. http://us.php.net/manual/en/function.pg-pconnect.php
gahooa
+1  A: 

It really depends how much you care about your data (seriously):

Does reliability matter in this case? If the process dies, can you just re-process everything? Or can't you?

Typically when calling a remote web service, you don't want to be calling it twice for the same data item. Perhaps there are side effects (like credit card charges), or maybe it is not a free API...

Anyway, if you don't care about potential duplicate processing, then take the batch approach. It's easy, it's simple, and fast.

But if you do care about duplicate processing, then do this:

  1. SELECT 1 record from the table FOR UPDATE (ie. lock it in a transaction)
  2. UPDATE that record with a status of "Processing"
  3. Commit that transaction

And then

  1. Process the record
  2. Update the record contents, AND
  3. SET the status to "Complete", or "Error" in case of errors.

You can run this code concurrently without fear of it running over itself. You will be able to have confidence that the same record will not be processed twice.

You will also be able to see any records that "didn't make it", because their status will be "Processing", and any errors.

gahooa
I think the advantages on the batch approach are convincing enough. I already implemented it this way. It's a web app, so I think loading all that data at once could be pretty bad on the server's memory right?
Fernando