ansaurus

Question

Want to convert process into a multi-threaded process

Answer 1

A:

Assuming you're using an external database that supports concurrency, this should be no problem. You may want to use row-level locking to prevent threads from working with the same data, as well as some algorithm that will help each thread pick different data to operate on.

pix0r 2010-01-05 17:22:30

Answer 2

+3 A:

SELECT TOP 1 * FROM Products WHERE isActive = 1

Would be a problem with this. Until the Top 1 is processed all threads would retrieve the same record. I would suggest changing this query to both select the product and update it to isActive=2 in the SAME sql call. This could be done with a stored procedure or anonymous begin/end block. Each call would than receive a unique record.

Other than that, you're golden.

Addendum: As suggested a timeout could be specified in case of failed or locked threads, this would allow another thread to pick up and process at a later time.

Chris Kannon 2010-01-05 17:30:02

put it in a transition state like isActive=2 so you know it is not processed yet, but a thread is working on it.

Peter Schuetze 2010-01-05 17:32:16

Peter, good idea, that would rule out false positives on failures.

Chris Kannon 2010-01-05 17:43:53

And add an expiration, after which is ceases to be active. This way, if the process that grabbed it dies, the product is eventually released for others to process.

Steven Sudit 2010-01-05 17:43:53

Answer 3

+3 A:

Since the slowest activity is html retrieval, this could be linearly sped up with 20, 50, or even 200 retrieval threads, depending on ISP bandwidth relative to the speed of the servers returning data.

It could be sensible to semi-virtualize the table to an in-memory array. So each thread looking for work would query a class member function which returns the next available row or handles updating it with being done. The class should also occasionally detects database updates if there are other updaters and flush in-memory updates back to the d/b every few seconds or minutes as makes sense.

I don't know Java, so here is an impressionistic algorithm in PHPish lingo:

class virtualProduct {

  const time_t maxSync = 10;  // maximum age for unsynched d/b to row[]

  static struct {  // singleton
     int isActive;
     int urlRowId;
     etc ...
  } row [];

  static time_t lastSync;  // timestamp of last sync with d/b

  static mutex theLock;  // mutex to protect read/write of above


  function syncData()
  {
     lock (&theLock);

     // flush local updates to d/b
     foreach (row as item)
         if (item.updated)
         {
               sql_exec ("update products set whatever = " + value + " where rowId = " + whatever);
               if (okay)
                    item .updated = false;
         }

     // update from d/b (needed if other entities are updating it)
     sql_query ("select * from products");

     row [] = sql results;
     lastSync = now();
     unlock (&theLock);
  }

  function virtualProduct ()  // constructor
  {
      ...
      syncData();  // initialize memory copy of d/b
  }

  function ~virtualProduct ()  // destructor
  {
      syncData();  // write last updates
      ...
  }

  function UpdateItem(int id)
  {
     lock (&theLock);
     if (now () - lastSync > maxSync)
         syncData();
     int index = row.find (id);
     if (index >= 0)
     {
         row [index] .fields = whatever;
         row [index] .isActive = 0;
     }
     unlock (&theLock);
  }

  function ObtainNextItem()
  {
     lock (&theLock);
     if (now () - lastSync > maxSync)
         syncData();
     result = null;
     foreach (row as item)
         if (item.isActive == 1)
         {
              item.isActive = 2;  // using Peter Schuetze's suggestion
              result = item.id;
              break;
         }
     unlock (&theLock);
     return result;
  }
}

There are still some minor wrinkles to fix like the double locking of the mutex in UpdateItem and ObtainNextItem (from calling into syncData), but that's readily fixed when translating to a real implementation.

wallyk 2010-01-05 17:48:27

wow, that sounds way faster and stable. but how?!

mrblah 2010-01-05 18:15:28

Okay, I've added the skeleton of the idea.

wallyk 2010-01-05 19:51:37

Answer 4

+1 A:

SELECT TOP 1 * FROM Products WHERE isActive = 1

likely results in table scan, and returning just one row a time in this case is wasting server's resources. It's better to return top 10 (or even more based on your application) a time and just do concurrent retrieval for the in-memory data, like wallyk pointed out.

Codism 2010-01-05 17:57:30

Answer 5

A:

Probably the easiest way would be to set up a thread pool and load a thread-safe queue with the rows.

Then, each "pool thread" would start by grabbing the top record on the queue (popping it off the queue, so other threads can't also grab the same record), download the HTML, and update the record.

KingRadical 2010-01-05 18:57:52

ansaurus

tags:

views:

answers:

Want to convert process into a multi-threaded process

related questions