views:

103

answers:

3

Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:

class Store implements Validatable{
int storeNo;
int storeName;
... etc.... 
}

So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:

interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}

Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:

 SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?

That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.

Alternatively, I could just do:

SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )

This query would actually return much faster than doing the one above it 2000 times. However, it doesn't go well with the design that a Rule can be run for a single store only.

I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?

What would you do if you were in my shoes, and what is the best practice?

A: 

I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.

If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.

eed3si9n
+1  A: 
SELECT store_id FROM store WHERE store_active = 1

or even

SELECT store_id FROM store

will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.

If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!

banjollity
for doing what you are saying, it would make sense to do it if I were to cache that result set somewhere. I have to validate one store at a time, even with the above mentioned approach, and not caching the result set, I dont see how I will save myself 1999 hits to the database.
Jay
Run the query and store the results in a List<Integer>. That's you first test; does the size of the list match the expected number of stores. Then you can iterate the list running the other tests on the stores using each Integer as the store id for any JDBC queries.
banjollity
You can use the query above inside the IN (select ...) as a subquery. No caching used at all.
txwikinger
+2  A: 

That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.

This is what you should not do.

Create a temporary table, fill the table with your values and JOIN this table, like this:

SELECT  STORE_ATTRIB1, STORE_ATTRIB2...
FROM    temptable tt
JOIN    STORE s
ON      s.STORE_ID = t.id

or this:

SELECT  STORE_ATTRIB1, STORE_ATTRIB2...
FROM    STORE s
WHERE   s.STORE_ID IN
        (
        SELECT  id
        FROM    temptable tt
        )

I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?

IN filters duplicates out.

If you want each eligible row to be selected for each duplicate value in the list, use JOIN.

IN is in no way a "not suggested methology".

In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.

But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.

See this article in my blog for performance details in SQL Server:

Note, that in a properly designed database, validation rules are also set-based.

I. e. you implement your validation rules as queries against the temptable.

However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.

Quassnoi