views:

296

answers:

6

Is it more performant to use a Prepared Statement with one question mark in it fifty times, or to use a Prepared Statement with fifty question marks in it once?

Essentially is Where Person = ? or Where Person IN (?, ?, ?, ...) better?

Example

Say you have a table with a column, country, and then a few relational tables away you have the population for that country.

Given a list of 1000 countries, what is the best way to go about getting the population?

Keep in mind this is a hypothetical example, Wikipedia puts the number of countries at 223, let's assume for this example it is much larger.

  1. Create a statement that takes in a country parameter and returns a population. Example: Where Country = ?

  2. Create a Prepared Statement dynamically, adding a ? for each country using a Where in (?,?,etc) clause. Example: Where Country = (?, ?, ...)

  3. Create a simple statement like in option one, but loop through and reuse the one parameter Prepared Statement for each country.

What is the preferable method?

A: 

As is often said, "That depends". If you're just looking for the population of a single country I'd go with method 1. I'd avoid #2 because I don't like to use dynamically-constructed SQL unless it's the only way to get the job done (efficiently), and this doesn't appear to be one of those cases. I'm not big on #3 either because I think that the loop will be inefficient if you need to fetch the population of all the different countries.

How about we add #4: a single statement that returns the population of all the countries, something like

SELECT C.COUNTRY_NAME, SUM(S.POPULATION)
  FROM COUNTRY C,
       COUNTRY_CENSUS_SUBDIVISION S
  WHERE S.ID_COUNTRY = C.ID_COUNTRY
  GROUP BY C.COUNTRY_NAME;

Build a method around that and have it return a Map of country to population if you need to obtain the population of all the countries at once.

Share and enjoy.

Bob Jarvis
Countries may be a poor example here, as there is a finite number of countries in the world. Let's assume, just for my example, there there hundreds of thousands countries and that number is increasing. Returning a list of them all just isn't viable from a performance perspective.
James McMahon
A: 

RAM is cheap. Load the whole list into a cached hash table and work at memory speed

If performance is an issue use RAM. You could spend days or weeks trying to optimise something that could fit into $100 worth of RAM

TFD
So that is option 2, right? So you are saying it is better to make use of the local cache (RAM) then hitting the database multiple times.
James McMahon
@James McMahon: Sort of, just read the WHOLE table into cache, no IN clause should be required. And then use in memory hash table/dictionary indexes to find items
TFD
@TFD: My example up above seems to be tripping people up. Countries was a bad example apparently just because the number of countries is so small. Trust me when I say that I do not have enough memory to load this database into memory, nor is it reasonable to expect to be able to.
James McMahon
You can at least cache the most common tables/columns. $1000 of RAM can hold a large amount of anything
TFD
A: 

There are two steps in executing a query:
1. Create the execution plan.
2. Execute the plan.

Prepared statements are related to step 1. In the example given I think that the most execution time will be in step 2, so I'd pick the alternative that gives best execution. A general rule to enable the DB engine to optimize is to give it range questions rather than looping in the client issuing several small questions. Available indexes and client-server latency of course affects how large the difference is, but I think that your option #2, to create a prepared statement dynamically often is the best alternative.

Have you done any tests of the different alternatives? If you have, what do they show?

Anders Abel
A: 

As others have stated, it depends on the number of parameters and the size of the data. From what you have stated in the comments, the source table could be something that has hundreds of thousands of rows. If that's the case, the question comes down to the number of allowed filtering inputs. Is your query only going to allow for a small set of inputs or does it need to allow for filtering for a thousand countries? If the later, then I'd recommend storing the selections into an intermediate table and joining off that. Something like:

Create Table CriteriaSelections
(
    SessionOrUsername nvarchar(50) not null
    , Country nvarchar(50) not null
)

On selection, you would populate this table and then query from it like so

Select ...
From BigFatCountryTable
    Join CriteriaSelections
        On CriteriaSelections.Country = BigFatCountryTable.Country
            And CriteriaSelections.SessionOrUsername = @SessionOrUsername

You can use the RNGCryptoServiceProvider to generate a random number if this might be called multiple times in different ways by the same "session" in parallel. The catch to this setup is that you need to clear out the selections table periodically.

If the entities in question are somewhat immutable (e.g. a Country, a City etc.) then using a caching strategy in conjunction with your querying strategy would also help.

BTW, another solution along the same lines is to use a temp table. However, if you do that you need to be careful to use the exact same connection for creation of the temp table, the population of the temp table and its use.

Thomas
This sounds like a good suggestion, unfortunately on the machine I am pulling the data from I have limited access. I definitely wouldn't be able to create table and I am unsure about using the temp table. Let's say I could create a temp table, how would you recommend getting the initial values into the table?
James McMahon
Getting the values into the temp table is easy. That's just doing a bunch of inserts like you would into a regular table. The tricky part is maintaining the same connection. So in one call you would do "Create Table #Foo...", then another call on the same connection would do a bunch of "Insert #Foo...", then lastly, on the same connection, you would run your select query where it would join to #Foo.
Thomas
Btw, another way of solving this problem is to pass a delimited list into a stored proc and then split that list into a temp table using a split function.
Thomas
A: 

Depending on database engine used, there might be another alternative.

For MS SQL for example, you could use a CSV->Table function, for example: http://www.nigelrivett.net/SQLTsql/ParseCSVString.html

Then you can provide your query with a comma separated string of values instead and join the table:

SELECT ..
FROM  table t
INNER JOIN dbo.fn_ParseCSVString(?, ',') x
     ON  x.s = t.id
WHERE ...

In this case, there will be two loops: building the CSV string (if you do not already have it in this format) and splitting the CSV into a table.

But it may provide better performance than executing join several times as well as using IN (which in my experience has pretty bad performance). If performance is really an issue, you should test of course.

Results may also vary depending on network overhead, etc...

Brimstedt
+2  A: 

I reached a point in my project were I was able to test with some real data.

Based on 1435 items, Option 1 takes ~8 minutes, Option 2 takes ~15 seconds, and Option 3 takes ~3 minutes.

Option 2 is the clear winner in terms of performance. It is a little harder to code around, but the performance difference is too great to ignore.

I makes sense that going back and forth to the database is the bottleneck, though I'm sure the results listed here would vary based on network, database engine, database machine specs, and other environmental factors.

James McMahon
Take into account that the number elements in an IN clause is limited. In older versions of Oracle it was 255 if I remember well. Perhaps in other DBMS apply the same.
Lluis Martinez
@Lluis Martinez, I was going to edit the answer to say the same thing. I agree that it would be important to keep in mind the limit of the number of parameters for a Prepared Statements, but on most modern databases I think the limit is fairly high. Just for reference, the database I am using in this particular project is DB2.
James McMahon