ansaurus

Question

Best practice for where = ?, where in (?) clauses in Prepared Statements?

Answer 1

A:

As is often said, "That depends". If you're just looking for the population of a single country I'd go with method 1. I'd avoid #2 because I don't like to use dynamically-constructed SQL unless it's the only way to get the job done (efficiently), and this doesn't appear to be one of those cases. I'm not big on #3 either because I think that the loop will be inefficient if you need to fetch the population of all the different countries.

How about we add #4: a single statement that returns the population of all the countries, something like

SELECT C.COUNTRY_NAME, SUM(S.POPULATION)
  FROM COUNTRY C,
       COUNTRY_CENSUS_SUBDIVISION S
  WHERE S.ID_COUNTRY = C.ID_COUNTRY
  GROUP BY C.COUNTRY_NAME;

Build a method around that and have it return a Map of country to population if you need to obtain the population of all the countries at once.

Share and enjoy.

Bob Jarvis 2010-02-25 12:53:34

Countries may be a poor example here, as there is a finite number of countries in the world. Let's assume, just for my example, there there hundreds of thousands countries and that number is increasing. Returning a list of them all just isn't viable from a performance perspective.

James McMahon 2010-02-25 12:57:00

Answer 2

A:

RAM is cheap. Load the whole list into a cached hash table and work at memory speed

If performance is an issue use RAM. You could spend days or weeks trying to optimise something that could fit into $100 worth of RAM

TFD 2010-03-02 20:50:37

So that is option 2, right? So you are saying it is better to make use of the local cache (RAM) then hitting the database multiple times.

James McMahon 2010-03-03 12:10:03

@James McMahon: Sort of, just read the WHOLE table into cache, no IN clause should be required. And then use in memory hash table/dictionary indexes to find items

TFD 2010-03-03 19:35:06

@TFD: My example up above seems to be tripping people up. Countries was a bad example apparently just because the number of countries is so small. Trust me when I say that I do not have enough memory to load this database into memory, nor is it reasonable to expect to be able to.

James McMahon 2010-03-03 20:15:19

You can at least cache the most common tables/columns. $1000 of RAM can hold a large amount of anything

TFD 2010-03-03 22:54:02

Answer 3

A:

There are two steps in executing a query:
1. Create the execution plan.
2. Execute the plan.

Prepared statements are related to step 1. In the example given I think that the most execution time will be in step 2, so I'd pick the alternative that gives best execution. A general rule to enable the DB engine to optimize is to give it range questions rather than looping in the client issuing several small questions. Available indexes and client-server latency of course affects how large the difference is, but I think that your option #2, to create a prepared statement dynamically often is the best alternative.

Have you done any tests of the different alternatives? If you have, what do they show?

Anders Abel 2010-03-06 20:13:14

Answer 4

A:

As others have stated, it depends on the number of parameters and the size of the data. From what you have stated in the comments, the source table could be something that has hundreds of thousands of rows. If that's the case, the question comes down to the number of allowed filtering inputs. Is your query only going to allow for a small set of inputs or does it need to allow for filtering for a thousand countries? If the later, then I'd recommend storing the selections into an intermediate table and joining off that. Something like:

Create Table CriteriaSelections
(
    SessionOrUsername nvarchar(50) not null
    , Country nvarchar(50) not null
)

On selection, you would populate this table and then query from it like so

Select ...
From BigFatCountryTable
    Join CriteriaSelections
        On CriteriaSelections.Country = BigFatCountryTable.Country
            And CriteriaSelections.SessionOrUsername = @SessionOrUsername

You can use the RNGCryptoServiceProvider to generate a random number if this might be called multiple times in different ways by the same "session" in parallel. The catch to this setup is that you need to clear out the selections table periodically.

If the entities in question are somewhat immutable (e.g. a Country, a City etc.) then using a caching strategy in conjunction with your querying strategy would also help.

BTW, another solution along the same lines is to use a temp table. However, if you do that you need to be careful to use the exact same connection for creation of the temp table, the population of the temp table and its use.

Thomas 2010-03-07 18:47:03

This sounds like a good suggestion, unfortunately on the machine I am pulling the data from I have limited access. I definitely wouldn't be able to create table and I am unsure about using the temp table. Let's say I could create a temp table, how would you recommend getting the initial values into the table?

James McMahon 2010-03-08 12:27:23

Getting the values into the temp table is easy. That's just doing a bunch of inserts like you would into a regular table. The tricky part is maintaining the same connection. So in one call you would do "Create Table #Foo...", then another call on the same connection would do a bunch of "Insert #Foo...", then lastly, on the same connection, you would run your select query where it would join to #Foo.

Thomas 2010-03-08 16:02:34

Btw, another way of solving this problem is to pass a delimited list into a stored proc and then split that list into a temp table using a split function.

Thomas 2010-03-08 16:49:01

Answer 5

A:

Depending on database engine used, there might be another alternative.

For MS SQL for example, you could use a CSV->Table function, for example: http://www.nigelrivett.net/SQLTsql/ParseCSVString.html

Then you can provide your query with a comma separated string of values instead and join the table:

SELECT ..
FROM  table t
INNER JOIN dbo.fn_ParseCSVString(?, ',') x
     ON  x.s = t.id
WHERE ...

In this case, there will be two loops: building the CSV string (if you do not already have it in this format) and splitting the CSV into a table.

But it may provide better performance than executing join several times as well as using IN (which in my experience has pretty bad performance). If performance is really an issue, you should test of course.

Results may also vary depending on network overhead, etc...

Brimstedt 2010-03-08 07:51:16

Answer 6

+2 A:

I reached a point in my project were I was able to test with some real data.

Based on 1435 items, Option 1 takes ~8 minutes, Option 2 takes ~15 seconds, and Option 3 takes ~3 minutes.

Option 2 is the clear winner in terms of performance. It is a little harder to code around, but the performance difference is too great to ignore.

I makes sense that going back and forth to the database is the bottleneck, though I'm sure the results listed here would vary based on network, database engine, database machine specs, and other environmental factors.

James McMahon 2010-03-08 19:42:20

Take into account that the number elements in an IN clause is limited. In older versions of Oracle it was 255 if I remember well. Perhaps in other DBMS apply the same.

Lluis Martinez 2010-03-08 20:02:36

@Lluis Martinez, I was going to edit the answer to say the same thing. I agree that it would be important to keep in mind the limit of the number of parameters for a Prepared Statements, but on most modern databases I think the limit is fairly high. Just for reference, the database I am using in this particular project is DB2.

James McMahon 2010-03-09 13:15:12

ansaurus

tags:

views:

answers:

Best practice for where = ?, where in (?) clauses in Prepared Statements?

Example

related questions