views:

56

answers:

3

Let me illustrate this question with a simplified example. Assume I am building a project using python with a PostgreSQL relational database. In my database I have two tables "parent" and "child" which are N to M related through the table "parent_child". I want to be able to retrieve some data about a specific child owned by a specific parent in a secure way, which the following query lets me do (X, Y, and Z are literals supplied by the user):

SELECT child.age FROM parent, parent_child, child 
WHERE child.id = parent_child.child_id AND parent_child.id = X 
AND parent_child.parent_id = parent.id AND parent.id = Y 
AND parent.password = Z; 

Say a user comes along and types in the wrong value for either X, Y, or Z, the query would return an empty set, which could be detected and a message passed on to the user that there was an error. The problem of course is that I am unable to determine which value is causing problems and hence can not supply the user with specific information about what they mis-entered?

The simplest solution to this is to break up the query into several parts. First, verifying that the parent.id exists.

SELECT parent.id FROM parent WHERE parent.id = Y;

Second, checking if the password is correct.

SELECT parent.id FROM parent WHERE parent.id = Y and parent.password = Z;

Third, checking if the child exists.

SELECT child.id FROM child WHERE child.id = X;

Fourth, checking that the child is owned by the parent and returning the information we need.

SELECT child.age FROM child, parent_child WHERE parent_child.child_id = child.id AND parent_child.parent_id = Y AND parent_child.child_id = X;

These four queries will allow us to check specific things about the user supplied information and report specific problems as they occur. Obviously there is a lot of additional overhead in four queries verses a single query and I find four queries less readable than a single one. So is there anyway to have the best of both worlds? A single query and detailed error messages?

+1  A: 

Well, the problem here lies in the fact that the query isn't actually erroring -- it's giving you the correct information for your criteria each time. So there really isn't a way to know without examining each query individually.

You could perhaps check to see if you got no rows, and THEN run your other queries to find out why, and that would cut your overhead.

Dave Markle
+3  A: 
SELECT  p.id, p2.z AS pw, pc.parent_id, CASE p2.z WHEN p.pw THEN c.age END AS age
FROM    (VALUES (1)) AS p1(y)
LEFT JOIN
        parent p
ON      p.id = p1.y
LEFT JOIN
        (VALUES ('pw1')) AS p2(z)
ON      p2.z = p.pw
CROSS JOIN
        (VALUES(1)) AS p3(x)
LEFT JOIN
        child c
ON      c.id = p3.x
LEFT JOIN
        parent_child pc
ON      pc.parent_id = p.id
        AND pc.child_id = c.id

NULLs in appropriate columns would mean that the appropriate conditions failed.

Quassnoi
This approach does increase the resolution of the possible problems caused by user supplied data but this increase in detail is not nearly as high as the four queries posted in the question. So while this is a good answer, I am still curious if there is anything better.
Nixuz
@Nixuz: see the post update. Note that the values of the parameters are inside the `VALUES` clauses.
Quassnoi
Again excellent answer, but some trade-offs are made. This answer is good as it allows errors to be reported in pretty much the same level of detail as the four queries solution. However, this solution is much more complex than the previous solution, which makes me wonder if the performance is going to suffer. Likewise the complexity reduces the readability. Finally, this solution seems less secure than the other solutions, as even if a wrong password used the child's age is still reported. Nonetheless, I appreciated the work put into this answer and will be utilizing it in my project. Thanks.
Nixuz
@Nixuz: performance will be just the same, as on each step the same check is made, the only difference being returning a `NULL` value instead of filtering the record out. As for readability — well, it was you who wanted one query, and it's quite simple in fact.
Quassnoi
A: 

These four queries will allow us to check specific things about the user supplied information and report specific problems as they occur.

Yes, that is the standard procedure (and it is there for a reason. Let's say you were updating rows: you would have used up all sorts of server resources, eg the transaction log, only to find that it failed, and roll the whole thing back. Quite avoidable. Always check each level, before attempting the next level. Never lock up or update anything until you have performed full verification. Never attempt anything unless you are sure that it will succeed. In this case, you are not updating, but the standard allows you to isolate the error, in the usual way, at the earliest moment, and avoid wasting resources (at later levels due to the earlier failure).

Obviously there is a lot of additional overhead in four queries verses a single query

I do not understand your arithmetic. Let's say each query to a table by PK costs 50 resource units if it is not in the data cache, 2 units if it is. Assuming PostgreSQL has a data cache, and a multi-threaded engine, and your code segment is a contiguous sequence (stored proc or not):

  • first statement = 50
  • second statement (since the page is in cache) = 2
  • third statement = 50
  • fourth statement (since parent and child are in cache) = 2 + 2 + 50
  • equals 156 units

  • more important, in the case of error, the cost (depending on where the error is located) is 50 or 52 or 102 units

  • whereas the standalone fourth statement costs 150 units

I find four queries less readable than a single one.

Put some white space and commentary in-between if you need to improve readability. (Your code is hard for others to read anyway; I would format it.)

A single query and detailed error messages?

Well, you are getting detailed errors, no more no less; what you are asking for is isolation of the error to a particular point in your code (or the user request). If you were writing a stored proc for generic use, and returned an error code, the sequence I have identified would be demanded.

Any other method (and I am sure there are complex and devious methods) would be (a) even more overhead and (b) introduce unnecessary complexity into a simple pedestrian requirement, and therefore be hard to maintain.

PerformanceDBA