views:

146

answers:

5

Let me go over what my code is doing (haven't coded it yet), I want to make it run in multiple threads to speed things up.

  1. looks for a record in the database that hasn't been processed yet:

    SELECT TOP 1 * FROM Products WHERE isActive = 1
    
  2. looks up a URL (Rest call), returns back the HTML and stores it in the database

  3. Sets the flag for that row:

    UPDATE Products SET isActive = 0 WHERE rowId = 234
    

So say I wrap the above into a method:

public void UpdateProduct()
{
}

Is it possible to make this process run in multiple threads? (say 2 or 3)?

Update Can someone show me a skeleton structure of how I would wrap this method into a multi-threaded process?

A: 

Assuming you're using an external database that supports concurrency, this should be no problem. You may want to use row-level locking to prevent threads from working with the same data, as well as some algorithm that will help each thread pick different data to operate on.

pix0r
+3  A: 
SELECT TOP 1 * FROM Products WHERE isActive = 1 

Would be a problem with this. Until the Top 1 is processed all threads would retrieve the same record. I would suggest changing this query to both select the product and update it to isActive=2 in the SAME sql call. This could be done with a stored procedure or anonymous begin/end block. Each call would than receive a unique record.

Other than that, you're golden.

Addendum: As suggested a timeout could be specified in case of failed or locked threads, this would allow another thread to pick up and process at a later time.

Chris Kannon
put it in a transition state like isActive=2 so you know it is not processed yet, but a thread is working on it.
Peter Schuetze
Peter, good idea, that would rule out false positives on failures.
Chris Kannon
And add an expiration, after which is ceases to be active. This way, if the process that grabbed it dies, the product is eventually released for others to process.
Steven Sudit
+3  A: 

Since the slowest activity is html retrieval, this could be linearly sped up with 20, 50, or even 200 retrieval threads, depending on ISP bandwidth relative to the speed of the servers returning data.

It could be sensible to semi-virtualize the table to an in-memory array. So each thread looking for work would query a class member function which returns the next available row or handles updating it with being done. The class should also occasionally detects database updates if there are other updaters and flush in-memory updates back to the d/b every few seconds or minutes as makes sense.

I don't know Java, so here is an impressionistic algorithm in PHPish lingo:

class virtualProduct {

  const time_t maxSync = 10;  // maximum age for unsynched d/b to row[]

  static struct {  // singleton
     int isActive;
     int urlRowId;
     etc ...
  } row [];

  static time_t lastSync;  // timestamp of last sync with d/b

  static mutex theLock;  // mutex to protect read/write of above


  function syncData()
  {
     lock (&theLock);

     // flush local updates to d/b
     foreach (row as item)
         if (item.updated)
         {
               sql_exec ("update products set whatever = " + value + " where rowId = " + whatever);
               if (okay)
                    item .updated = false;
         }

     // update from d/b (needed if other entities are updating it)
     sql_query ("select * from products");

     row [] = sql results;
     lastSync = now();
     unlock (&theLock);
  }

  function virtualProduct ()  // constructor
  {
      ...
      syncData();  // initialize memory copy of d/b
  }

  function ~virtualProduct ()  // destructor
  {
      syncData();  // write last updates
      ...
  }

  function UpdateItem(int id)
  {
     lock (&theLock);
     if (now () - lastSync > maxSync)
         syncData();
     int index = row.find (id);
     if (index >= 0)
     {
         row [index] .fields = whatever;
         row [index] .isActive = 0;
     }
     unlock (&theLock);
  }

  function ObtainNextItem()
  {
     lock (&theLock);
     if (now () - lastSync > maxSync)
         syncData();
     result = null;
     foreach (row as item)
         if (item.isActive == 1)
         {
              item.isActive = 2;  // using Peter Schuetze's suggestion
              result = item.id;
              break;
         }
     unlock (&theLock);
     return result;
  }
}

There are still some minor wrinkles to fix like the double locking of the mutex in UpdateItem and ObtainNextItem (from calling into syncData), but that's readily fixed when translating to a real implementation.

wallyk
wow, that sounds way faster and stable. but how?!
mrblah
Okay, I've added the skeleton of the idea.
wallyk
+1  A: 
SELECT TOP 1 * FROM Products WHERE isActive = 1 

likely results in table scan, and returning just one row a time in this case is wasting server's resources. It's better to return top 10 (or even more based on your application) a time and just do concurrent retrieval for the in-memory data, like wallyk pointed out.

Codism
A: 

Probably the easiest way would be to set up a thread pool and load a thread-safe queue with the rows.

Then, each "pool thread" would start by grabbing the top record on the queue (popping it off the queue, so other threads can't also grab the same record), download the HTML, and update the record.

KingRadical