tags:

views:

158

answers:

5

[Book] isbn(PK),title,category_id,subcategory_id,price

[Author] isbn(FK),author_id(PK),name

[Category] category_id(PK),name

[SubCategory] sub_category_id(PK),category_id(FK),name

I have a database (not designed by me) that contains the above four tables.

I want to have a book list having the following format:

isbn, title, author name(s), category name, subcategory name(may not have), price

But there's some complexity, as you can see, each book can have more than one author, the author name column should have author names separated by commas.

And for the category which is the more difficult part, there are some categories that have no subcategories and thus, some book records have subcategory_id set to 0 because its category_id refers to an category that has no subcategories, in this case, the subcategory name column in the book list does not need to show anything.

I really have no idea how such a complex a complex SQL statement can be built quickly to get the book list. Would somebody kindly think of a solution?

Many thanks to you all.

A: 

Something like this should be close. If this is HomeWork, please add the HomeWork tag.

select
   Book.ISBN,
   Book.Title,
   Author.Name,
   Category.Name as Category_Name,
   SubCategory.Name as SubCategory_Name,
   Book.Price
from
   Book join Author
      on Book.ISBN = Author.ISBN
   join Category
      on Book.Category_ID = Category.Category_ID
   join SubCategory
      on Book.Category_ID = SubCategory.Category_ID
         and Book.SubCategory_ID = SubCategory.Sub_Category_ID
J.Hendrix
thank you very much for your reply, basically it should work but the problem of making author names separated by commas if there's more than one author for that book is not resolved in this sql statement.
bobo
+2  A: 

When you find yourself building an "extremely complex SQL statement", it's usually best to step back and rethink.

Remember this - the vast majority of operations performed on database table are selects, not inserts or updates (though there are exceptions to every rule, of course).

The right time to be "spending" CPU cycles calculating things like author lists is when the list changes, not when you just want to extract the information.

Add another column to the book table called author_list and then create an insert/update trigger on authors so that this column is rebuilt whenever an author is changed for the specific ISBN.

That puts the cost where it should be and will make your query a lot simpler. The trigger ensures the data stays consistent, and it's okay to break 3NF if you know what you're doing.

As to the subcategory, the case statement can be your friend, but per-row functions on select never scale well.

I would just create a set of rows in subcategories with the id of 0 (one for each category) and make its name blank. Then it can be done with a simple join without having to concern yourself with performance. This could also be don with a trigger on category so every category will always have a subcategory of 0.

With those two changes, the query becomes a lot less complex, something along the lines of:

select b.isbn, b.title, b.author_list, c.name, sc.name, b.price
from Book b, Category c, SubCategory sc
where b.category_id = c.category_id
and   b.category_id = sc.category_id
and   b.subcategory_id = sc.subcategory_id
order by ...

This query should scream along since it's using just the basic levels of relational algebra (i.e., no per-row functions (including case statements), no subqueries). And that's an "old-school" query, you may get even more performance by using explicit rather than implicit JOINs.

One final point: a properly 3NF schema would not have the ISBN in the authors table - a better option would be to have a separate BookAuthor table holding the ISBN and author_id to properly model the many-to-many relationship. But you may have aleready changed that for performance (I don't know).

paxdiablo
"Add another column to the book table called author_list and then create an insert/update trigger on authors so that this column is rebuilt whenever an author is changed for the specific ISBN." That's very dubious advice.
tpdi
I disagree with building the author name list initially using a trigger; I would not do this unless I found that doing it in a select statement (or more likely a UDF) was a big performance hit in testing. Maintaining a list like this means that redundant data is being stored in the database (the author name/book association).
RMorrisey
Not dubious at all. Disk space is cheaper than CPU grunt and selective reversion to 2NF is a well-known technique for performance.
paxdiablo
Yes, it's redundant but still *consistent*. It's a trade-off, but one that doesn't affect ACID.
paxdiablo
@Pax: The problem with the trigger idea is that you need to size the column to accomodate potentially larger amounts of information or you will truncate. That's a bit of guesswork. Also, that database is in the second normal form for authors, done in the third normal form it would have an authors table with IDs and then a book_authors table that linked the books and the authors together. You're actually recommending a variation on the first normal form.
John Cavan
@Pax: On the join front, I don't recommend putting false data into the database. You can achieve the desired outcome with a left join and, with a proper index, the expense would still be minimal. False data may need to be excluded with other queries and blank is not NULL.
John Cavan
On that last point, John, you may have missed my update. I *think* we agree on the book-authors bit. And you're right about the problem, though I believe it's a managable one. What we do (when we do it, which is generally only for big tables), is just set the column to N chars then run periodic checks to ensure there's been no overflow (db re-org and retrigger on the affected rows) if there has). Truncation (at least in this case) seems fine since it's for display only, any real data analysis would be done with the authors table, not the author_list filed.
paxdiablo
And, on the "false data" front, the question was seeking to avoid complex queries. Sentinel values in tables is another technique commonly used and it complicates other queries with a simple "where id <> 0". In terms of the question asked, I think that's preferable to a (potentially) monstrous query for this purpose. As I said, these are all trade-offs and you need to understand the costs of alternatives.
paxdiablo
Of course, these are all my *opinions* (though backed up with (depressingly) sizable experience). You're free to disagree, you wouldn't be the first :-)
paxdiablo
@Pax: I can see that point, but it strikes me as risky because there may be a period of time when the displayed data is incorrect. Now, I'll admit that this looks like a homework question, so it doesn't have a client screaming bloody murder because Joe Blow didn't see all the authors, but I'd hazard a guess that the Prof may not buy into the solution. Mind you, he may not buy into mine either, since I suggested doing it outside of SQL. :)
John Cavan
Well, I don't think avoiding complexity was the question, it was how to actually do such a complex thing in a SQL statement. Frankly, I don't think you can deal with a n-level author possibility in a single SQL statement and have it return one row without manipulating the data ahead of time as you suggested. My own preference would be to take that logic out of SQL.
John Cavan
Pax, you're adding unnecessary kludges (an author list column), and when that's brought to your attention, you're "fixing" it by adding more ad hoc kludges: triggers, chron jobs. It all adds up to a unmaintainable mess. The real fix is to normalize (as others have pointed out, table authors is a 2NF mess, as I pointed out, category/subcategory is too), and to make the author's list in an aggregate function (e.g, MySql's group_concat, or a user-defined one in, e.g., Oracle or postgres) or in the front-end that accesses it. I won't dispute your sizable experience, but....
tpdi
@tpdi, I'm not going to argue the point. It's only unmaintainable to those who don't know what they're doing. Denormalization is standard practice (and *well* understood) for performance reasons. Normalizing won't fix the problem of ungainly queries and, *as I've pointed out*, you have to understand the trade-offs. A single trigger maintains consistency and moves the performance hit where it should be, on data *update*, not *extract*. It probably won't matter with a small DB but we build for scalability and denormalization/sentinals are a vital part of that.
paxdiablo
It doesn't look like either of us is going to convince the other, so we should probably leave it at that. I'll let you have the last word if you wish, unless I consider it obviously wrong :-) Cheers.
paxdiablo
@pax in terms of performance/scalability, your solution to use a create/update trigger to make the author names available is really a nice one, especially when majority of db accesses are SELECT statements
bobo
Yes, just keep in mind the *potential* problems pointed out by others here. They're probably not showstoppers (in my opinion) but only you know for sure. They are (1) larger amount of storage needed, though I'm sure it's always better to sacrifice storage for performance; (2) possibility of loss of data in author list column, though this may not be important since it's probably display only and, if your trigger runs out of space in that column, it can always revert to tricks like "Kernighan, Ritchie et al" and you can flag that possible expansion of the column is needed (to be acted on later).
paxdiablo
A: 

See @Pax's answer for a nicer way to handle the null / zero values for sub_category_id

select isbn, a.name as author_name, c.name as category_name, sc.name as subcategory_name, price
from Book 
join Author a on isbn = a.isbn
join Category c on category_id = c.category_id
join SubCategory sc on category_id = sc.category_id and subcategory_id = sc.subcategory_id
where subcategory_id != 0
union
select isbn, a.name as author_name, c.name as category_name, '' as subcategory_name, price
from Book
join Author a on isbn = a.isbn
join Category c on category_id = c.category_id
join SubCategory sc on category_id = sc.category_id and subcategory_id = sc.subcategory_id
where subcategory_id = 0
Hamish Smith
thank you very much for your reply, basically it should work but the problem of making author names separated by commas if there's more than one author for that book is not resolved in this sql statement.apart from this, your solution of handling the subcategory=0 problem is simple and i really like it.
bobo
combining your sql statement with group_concat function, the book list is now available
bobo
yes, Pax's answer is a very useful one when we want to build an application that scales very well
bobo
A: 

Well, the subcategory business is poor database design. Even if you assume that a book can only be in one category, it's a poor design because (in that case), a category can always bee derived from the subcategory, so you've introduced redundancy by having book have attributes for both.

As far as the query you want, that's just a matter of doing the joins and projecting teh select statement. In you don't know enough SQL to do that, you probably shouldn't be trying to write queries (or you should be asking about basic joins and projections).

As to how you turn multiple rows into one (which is what you want to do with the authors), that depends on your RDBMS (which you don't specify) and/or your front-end.

tpdi
yes the db was poorly designed, thank you so much for your reply
bobo
+1  A: 

That's an odd schema, not how I would have designed it. Being denormalized, it's probably going to have a lot of duplication in the author table.

Anyways, because you may have one or more authors, joins aren't really going to cut it for that information. Some things, to be honest, are better done outside of SQL and this is one of them. You can just build a loop that constructs the information and emits the data when the ISBN changes, assuming you do your ordering well.

As for the categories and subcategories, use a left join and it will return NULL on the subcategory information which you can test for. If there's more than one subcategory possible for the book (or categories for that matter), then you're really DOA with SQL here.

John Cavan
yes, if the db doesn't have a handy group_concat like function provided, it'd be easier to do that in php
bobo