Since the slowest activity is html retrieval, this could be linearly sped up with 20, 50, or even 200 retrieval threads, depending on ISP bandwidth relative to the speed of the servers returning data.
It could be sensible to semi-virtualize the table to an in-memory array. So each thread looking for work would query a class member function which returns the next available row or handles updating it with being done. The class should also occasionally detects database updates if there are other updaters and flush in-memory updates back to the d/b every few seconds or minutes as makes sense.
I don't know Java, so here is an impressionistic algorithm in PHPish lingo:
class virtualProduct {
const time_t maxSync = 10; // maximum age for unsynched d/b to row[]
static struct { // singleton
int isActive;
int urlRowId;
etc ...
} row [];
static time_t lastSync; // timestamp of last sync with d/b
static mutex theLock; // mutex to protect read/write of above
function syncData()
{
lock (&theLock);
// flush local updates to d/b
foreach (row as item)
if (item.updated)
{
sql_exec ("update products set whatever = " + value + " where rowId = " + whatever);
if (okay)
item .updated = false;
}
// update from d/b (needed if other entities are updating it)
sql_query ("select * from products");
row [] = sql results;
lastSync = now();
unlock (&theLock);
}
function virtualProduct () // constructor
{
...
syncData(); // initialize memory copy of d/b
}
function ~virtualProduct () // destructor
{
syncData(); // write last updates
...
}
function UpdateItem(int id)
{
lock (&theLock);
if (now () - lastSync > maxSync)
syncData();
int index = row.find (id);
if (index >= 0)
{
row [index] .fields = whatever;
row [index] .isActive = 0;
}
unlock (&theLock);
}
function ObtainNextItem()
{
lock (&theLock);
if (now () - lastSync > maxSync)
syncData();
result = null;
foreach (row as item)
if (item.isActive == 1)
{
item.isActive = 2; // using Peter Schuetze's suggestion
result = item.id;
break;
}
unlock (&theLock);
return result;
}
}
There are still some minor wrinkles to fix like the double locking of the mutex in UpdateItem
and ObtainNextItem
(from calling into syncData
), but that's readily fixed when translating to a real implementation.