Is a good idea to build in-memory indexes and circumvent the DB when operating intensively on a small subset?

I'm working on a program to automatically find optimal shift assignments, subject to lots of constraints. I'm using grails, i.e. the data about workers, shifts and assignments will be kept in a DBMS.

For the optimization itself, I'll have to work very intensively on a small subset of the data (about 600 rows total from about 5 different tables). I'll have to iterate over and search through various sub-subsets dozens of times to compute fitness functions, change some values, compute fitness again, lather, rinse, repeat, perhaps hundreds of times.

Now, while searching and iteration are exactly what a DBMS is for, I believe that in this case the overhead of hundreds of DB requests would dwarf the actual work being done, even for an in-memory DBMS like HSQLDB. So instead, I'm planning to slurp the entire subset into memory at the beginning, build my own indexes (HashMap, mainly) for the lookups I'll have to do, and then work only with those, staying away from the DB until I'm done and write my result to it.

Is this a sound approach? Any better ideas?

I'm assuming you must issue hundreds of commands to the database? There's no way to execute the code inside the DB?

The main thing I'd be worried about is integrity; make sure you handle locking correctly. You'd probably want a version number stored somewhere so you don't need to lock the entire set of data for the duration of processing. In the update transaction, you'd first ensure the version number is the same as when you started reading.

Finally, benchmark it? I've done some apps over the last year or so that had a similar very intensive compute process per request. Using in-process objects to represent the data was orders of magnitude more efficient than hitting the database per request. But every app is different and there might be things not considered that'll impact it.

I suppose I could theoretically do the entire optimization stuff in a stored procedure, but I'd rather not deal with PL/pgSQL and can't use a DB with Java stored procedures.

Michael Borgwardt 2009-04-02 09:59:39

The locking is a good point, though I'm not expecting many cocurrent users, especially not concurrent attempts to optimize the same schedule. I'll definitely use a profile to find performance bottlenecks.

Michael Borgwardt 2009-04-02 10:01:14

ansaurus

tags:

views:

answers:

Is a good idea to build in-memory indexes and circumvent the DB when operating intensively on a small subset?

related questions