tags:

views:

59

answers:

2

heya,

We have a web-based application, backed by a MySQL database.

One part of the system that we're coding requires us to store attendance (i.e. yes/no) to sessions for users for each day of a week. For example, we'd need to store Monday through to Friday, then for each day, morning, lunch, afternoon, evening sessions etc. So essentially it's a 2-dim array.

I was wondering what's the cleanest way of storing this in the database?

At the moment, the person working on this seems to be leaning towards storing this as one int for each day, with 1's representing attendance and 0's representing not attending. I think what the mean to do is use a bitmask (e.g. 13 for 1101, so every session except afternoon). They're just storing it as actually 0's and 1's for some strange reason.

I thought it might be easier to store it as a list of bools (bits/tinyints), e.g. monday_morning, monday_lunch, monday_afternoon etc., as it's semantically more "correct" (I think?), it'll probably be easier to extend/maintain, and I also seem to be the only one on the team with any inkling of how to do bit-operations...lol.

Another way I was thinking was just to have a 1:1 table for each user, with a list of all the times they are attending, for example. Efficiency of this approach? (Not sure what sort of read/write patterns, but I'm guessing a fairly even spread of read/modifies).

What are some recommendations on this? Or are there better ways of storing this data?

Also, as a side-note, it probably will be boolean - it'd doubtful we'll need to store more states than attending/not-attending in the table, and if we do, we are prepared to re-work the schema. Or do people suggest strongly going for ints over bits?

Cheers, Victor

A: 

Your second approach (the individual columns) is "more correct" in that it doesn't violate first normal form. The bitmask approach does, since you're storing more than one value in a single column (you're storing values for multiple sessions).

And don't store a bit internally. You aren't going to see any decrease in storage over, say, a tinyint (the engine isn't going to allocate exactly one bit for you, it will just restrict the acceptable values). You may as well use a tinyint and give yourself some breathing room.

Edit

As pointed out by Mark, if you have multiple bit columns it can pack them into a single byte, but worrying about whether the data takes up one byte or four is likely a premature optimization. The most normalized solution is the one suggested where you have an individual table that indicates which sessions the participant attended. If your sessions truly are fixed, then I would likely go with having separate columns for each session over either the bitmask or the fully normalized solution.

  1. The bitmask obfuscates the data and requires bitwise operations (obviously). These can be confusing in query syntax, since you're making multiple uses of the words or and and. This approach also can't be indexed, so finding all participants who attended, say, the morning or the morning and evening sessions will require a table scan every time.

  2. The fully normalized solution will complicate queries of the data. While it will support indexing, it will require a full join for every session type you want to check.

The one-column-per-session approach seems like the best solution. You're still only dealing with one row of data, but you can also query with meaningful syntax and take advantages of indexes.

Adam Robinson
Sybase (and thus I suspect MS SQL Server) does store up to 8 bit columns in a byte so you do get storage benefits
Mark
heya,Yeah, it was my understanding that a lot of RDBMS would pack consecutive bits together anyway. If so, are tinyints still the way to go? I'm not really a DB guy though.Would you recommend going for the bitmask approach though? I have less experience using them in databases - in code I'd use it if suited the sorts of reads/writes I needed to run, so it would be useful here, from that respect, I suppose - however, is it really something done much in databases? Or are other approach (e.g. like tvanfosson suggests above) more correct/efficient, and more common?Cheers,Victor
victorhooi
A: 

I would normalize it and have three tables: users, sessions, and sessions_attended. Users would contain information about the user, sessions would contain information about the session, and sessions_attended would be a join table indicating which sessions the user attended. Index your tables properly and the resulting joins should be pretty efficient.

 select users.name, sessions.name
 from users u join sessions_attended a on u.user_id = a.user_id
      join sessions s on s.session_id = a.session_id
 where sessions.course = ...some course id...
tvanfosson
heya, The only issue is, it's actually more likely that users will attend all sessions, or long consecutive stretches than here and there. Basically, it'll be like a five-day conference, and most people will go for all 5 days, or arrive a day late, or leave half-a-day early, for example. And we really only need to store yes/no to attendance. Would you still recommend we use the join table approach though?Thanks,Victor
victorhooi
My feeling is that you normalize until normalization becomes a problem.
tvanfosson