tags:

views:

117

answers:

6

Say tableA has 1 row to be returned but will have 100 columns returned while tableB has 100 rows to be returned but only one column from each. TableB has a foreign key for table A.

Will a left join of tableA to tableB return 100*100 cells of data while 2 separate queries return 100 + 100 cells of data or 50 times less data or is that a misunderstanding of how it works?

Is it ever more efficient to use many simple queries rather than fewer more complex ones?

A: 

I think that your question basically is about database normalization. In general, it is advisable to normalize a database into multiple tables (using primary and foreign keys) and to join them as needed upon queries. This is better for insert/update performance and for keeping the data consistent, and usually results in smaller database sizes as well.

As for the row numbers returned, only a cross join would actually return 100*100 rows; any inner or outer join will not create all combinations, but rather tie together rows on the given conditions, and for outer joins preserve rows which could not be matched. Wikipedia has some samples in its JOIN article.

For very query-intense applications, the performance may be better when using less normlized tables. However, as always with optimizations, I'd only consider going into that direction after seeing real measurable problems (e.g. with a profiling tool).

In general, try to keep the number of roundtrips to the database low; a large number of single simple queries will suffer from the overhead of talking to the DB engine (network etc.). If you need to execute complex series of statements, consider using stored procedures.

Lucero
He's not talking about 100*100 rows, he's talking about 100 rows of 100 columns (100*100 cells). Compared to separately retrieving the 1 row of 100 columns, and then the 100 rows of 1 column from the related table.
Kirk Broadhurst
@Lucero In that wikipedia JOIN article article it says (under Left outer join) that "the values in the left table will be repeated for each distinct row on the right table." That seems to suggest that the data from tableA is transported multiple times.
kjack
@kjack, an outer join works the same way as an inner join (which may repeat rows - I'll come to that), but the rows which do not match the join predicate from either the left, the right or both data sets are not discarded but kept. In any join, if a predicate matches several rows, it will repeat the join for each matching row, which is that "duplicate" - but that is not limited to outer joins.
Lucero
+2  A: 

What is best at joining? A database engine or client code? Saying that, I use both techniques: it depends on the client and how data will be used.

  • Where the data requires some processing to, say, render on a web page I'd probably split header and details recordsets. We do use this because we have some business logic between DB and HTML

  • Where it's consumed simply and linearly, I'd join in the database to avoid unnecessary processing. For example, simple reports or exports

gbn
Also the KISS principle, I find it easier to write short queries!
kjack
+1  A: 

It depends, if you only take into account the SQL efficiency obviusly several simpler and smaller result queries will be more efficient. But you need to take into account the whole process if the join will be made otherwise on the client or you need to filter results after the join, then probably the DBM will be more efficient that doing it on your code.

Coding is always a tradeoff between diferent systems, DB vs Client, RAM vs CPU... you need to be conscious about this and try to find the perfect solution.

In this case probably 2 queries outperform 1 but that is not a general solution.

frisco
+4  A: 

First and foremost, I would question a table with 100 columns, and suggest that there is a possibly a better design for your schema. In the real world, this number of columns is less common, so typically the difference in the amount of data returned with one query vs. two becomes less significant. 100 columns in a table is not necessarily bad, just a flag that it shold be considered.

However, assuming your numbers are what they are to make clear the question, there are a few important variables to consider:

1 - What is the speed of the link between the db server and the application server? If it is very slow, then you are probably better off minimizing the amount data returned vs. the number of queries you run. If it is not slow, then you will likely expend more time in the execution of two queries than you would returning the increased payload. Which is better can only be determined by testing in your own environment.

2 - How efficient is the transport protocol itself? Perhaps there is some kind of compression of the data, or an even more clever algorithm that knows column 2 through 101 are duplicate for every row, so it only passes them once. Strategies like this in the transport protocol would mitigate any of your concerns. Again, this is why you need to test in your own envionment to know for sure.

As others have pointed out, you also need to consider what will be done with the data once you get it (e.g., JOINs, GROUPing, etc), but I am limiting my response to the specifics of your question around query count vs. payload size.

RedFilter
+1 The speed is the most important thing. A SQL join will be quicker than piecing the data together in your application, but you need to consider if you're going to lose all that and more by transporting much more data. And yes, 100 columns is way too many.
Kirk Broadhurst
@Redfilter. Clever transport protocols, excepted, the consensus amongst answers seems to be that the ata in tableA will be transported 100 times.
kjack
As you guessed I picked 100 rows for illustrative purposes but sometimes there are joins and joins and then outer joins in the one query which could have a similar effect. Especially when people just select all the columns rather than specific ones
kjack
+1  A: 

Generally fewer queries makes for better performance, as long as the queries return data that is actually related. There is no point in trying to put unrelated data into the same query just to reduce the number or queries.

There are of course exceptions, and your example may be one of them. However, it depends on more than the number of fields returnes, like what the fields actually return, i.e. the actual amount of data.

As an example of how the number of queries affects performance, I can mention a solution that I have (sadly enough) seen many times. In that solution the programmer would first get a number of records from one table, then loop through the records and run another query for each record to get the related records from another table. This clearly results in a lot of queries, and a solution having either one or two queries would be much more efficient.

Guffa
+1  A: 

“Is it ever more efficient to use many simple queries rather than fewer more complex ones?”

The query that requires the least amount of data to traverse, and gives you no more than what you need is the more efficient one. Beyond this, there can be RDBMS specific conditions that can be more efficient on one RDBMS system than another. At the very low level, when you deal with less data, then your results can be retrieved much quicker, so efficient queries are queries that only work with the least amount of data needed to get you the result you are looking for.

Vijay Selvaraj