views:

52

answers:

3

There's lots that can be taught about transactions in databases (and I'm really talking about relational databases here, since that's what I'm currently teaching), but I believe it's easy to teach the wrong things. Given that students are very likely going to forget most of what you say before they leave the lecture room, where should the focus go?

Some key topics that often feature in textbooks and lecture slides are (going by Ramakrishnan and Gehrke as a main guide) - but don't feel restricted by them:

  • ACID properties
  • schedules and their serializability
  • conflict serializability and precedence graphs
  • SQL transaction statements (COMMIT, ROLLBACK, SAVEPOINT)
  • locking protocols (Strict 2PL and such)
  • isolation levels (READ COMMITTED, REPEATABLE READ, etc.)
  • logging and recovery (WAL, ARIES)
  • transaction support in APIs (JDBC, PHP)

I'm after an informed view of what topics are most worthwhile, and what's not. I'm also interested in where you're coming in from in answering. Have you found some nugget of insight invaluable in a job, or are you frustrated that newly graduated employees don't know some basic detail?

Background: I'm teaching the topic in an introductory DBMS course to students of a 1-year IT Masters course. The students are computer-literate (mostly CS) graduates.

+3  A: 

First of all: is your focus using transactions or implementing them? What is the background of the students - particularly, do they understand concurrency?

My opinion as a non-DB expert who nevertheless needs to use them all the time:

  • I think the actual SQL statements are just a detail; they are worth including to make them less menacing in practice, but let em look up the syntax themselves. You could intentionally use a variety of syntaxes (eg. using APIs) simply to get across the point that it's not the name that matters, but the idea and API contract a transaction represents.
  • Understanding isolation levels is required - at least up to a point; they should understand that life isn't always SERIALIZABLE, why one would (almost always) chose that handicap, and what kind of problems to watch out for. Oh, and various SNAPSHOT varieties should be included (I'm mentioning this explicitly since these in particular don't always show up in lectures yet are very widely used.)
  • Locking (or the lack thereof) is obviously related to the isolation levels and their advantages and disadvantages. In particular, minor seemingly irrelevant changes to a transaction can affect the order in which locks are acquired and thus the presence and absence of deadlocks. In that respect it may be important; but otherwise it's an implementation detail (from my perspective) preferably ignored.
  • In terms of ACID, an understanding that it's not an all or nothing affair and that it can be quite useful to give up parts in return for other (particularly performance) benefits.
  • Perhaps it's neat to point out that lots of apps use databases, even desktop apps, partially because the simple, robust guarantees of a DB make it easier to do persistence (e.g. sqlite in firefox or adobe reader) with than raw I/O. Anybody doing programming is likely to occasially touch a DB; there's almost no avoiding it: and you wouldn't want to since the atomicity and consistency guarrantees are valuable even in a non-concurrent context (if your PC crashes, your lightroom DB isn't corrupted).
  • Also, it should be clear that in a DB (in general) all statements are executed in an implicit transaction; you're not avoiding transactions by avoiding the explicit transaction statements, merely using more, smaller transactions which can still deadlock and cause locking or fsync overhead. In otherwords: if the students want to work with a DB (which they almost inevitable will have to and will want to) they simply cannot avoid transactions.

Background: I write webapps at work and other stuff for fun.

Eamon Nerbonne
+1: All excellent points. I've inherited lecture material that seems to lean too far to the academic (identifying non-serializable schedules with precedence graphs) rather than the practical. I doubt the students know much about concurrency, so that's certainly also going to feature strongly.
beldaz
+1  A: 

One thing I would always mention in the context of "transactions", is that the concept is not at all limited to databases.

Take a printer resource that is included in a transaction, in which "confirmation sheets" are printed, and how the ACID properties of such a transaction imply:

  • A: either the database is updated AND a confirmation sheet is printed, or no confirmation sheet is printed AND the database is not updated.
  • c: No two confirmation sheets from distinct transactions get mingled up.
  • I: At end-of-transaction, the paper feed is always positioned at top-of-page
  • D: At end-of-transaction, the confirmation sheet is physically available, no paper jams or such have occurred.

Also, I've not seen any mention of things such as two-phase commit. Perhaps you see this as part of the "and such", but I think it's worth mentioning as a topic in its own right.

Erwin Smout
+1 for two-phase commit, but I'd like to hear more about why you think it's a good thing to concentrate on.
beldaz
Also, aren't you the same Erwin Smout as http://stackoverflow.com/users/313840/erwin-smout ? How come the different user ID?
beldaz
+1  A: 

See to my mind that is exactly what not to teach, I could hear the students snoring as I read it.

What is important about transactions - data integrity. Show how to set up an explicit transaction for multiple statements and write the same statements without an explicit transaction. Show them how very messed up the data can get when you don't use explicit transactions. In the process of your example cover things like ACID but use it in context of an actual example of how to do it so it works and more importantly, what happens when you do it wrong.

Then get into how transaction from differnt users can conflict with each other and why that becomes a problem. Talk about how people often solve locking problems by using dirty reads and why that is a bad thing to do. Use actual data examples not just boring lectures. Make them work to get the answers instead of telling them.

But always go back to the data in the database and it's integrity. Teach with actual examples not lecture notes and Power Point and have the students tell you what the problem was with the bad examples after they query the data, rather than telling them (in my experience, they learn more when you force them to figure things out).

BTW, you aren't teaching this before query skills are you? No database lecture on technicals of the database operation (including normalization) should be taught before query skills becasue until they can query the database they aren't equipped to understand anything else and if you start with something esoteric they wil stop listening and not hear a thing all semester. They need query skills more than any other single thing and it should be the first lecture before anything else. Sadly the textbooks are not written this way and this is to my mind (and yes I've taught databases) the single most important factor in why students get out of school understanding nothing about databases.

Sorry about the lecture, you hit one of my nerves with your question.

HLGEM
Hmm, I evidently did hit a nerve, but I don't think my question is in conflict with anything you've written. Clearly you have strong feelings about how it should be taught, but I've asked about what should be taught. You mention focusing on data integrity, covering ACID properties and conflict anomalies, which I quite agree with. (And yes, querying was covered a while back).
beldaz