ansaurus

Question

What is the best way to optimize schema for capturing attendance data

Answer 1

A:

IMHO, having a single row per user per month with a lot of concatenated characters isn't going to be any better than having lots of rows with a single character on it, especially if you're going to have to split that string everytime you want to display the data on another application.

If you just want to figure out the number of days a user attended your camp, why not create a table specifically for that? Everytime you logged a user's attendance, you would only have to update that table by increasing the number of days that the user had attended. As such, this value would not be calculated on-the-fly and it shouldn't give you any performance issues.

So, my advice would consist in two tables:

id | user_id | date | present

and

user_id | month | attendance

You should have some indexes on the user_id field as well, in order to increase the performance of the system.

Cheers

Hal 2010-07-07 09:03:04

@Hal - Assuming that the secondary table would be used to increment or decrement whenever is marked. So you are advising against the count(*) mechanism to arrive at the total number of days when a player was present. Am I correct?

Samuel 2010-07-07 09:36:06

Yup, mainly because it can cause performance issues. But if you have a fast server or aren't expecting a large number of users then the count mechanism is fine. Either way, forget the concatenation.

Hal 2010-07-07 10:05:37

Answer 2

+2 A:

You should ask yourself why you would do that.

There are some possibilities, but it is likely that your database schema won't be fully normalized.

So first of all: what do you want to achieve and what are the reasons for that?

Some possibilities:

Some DBMS provide the ability to create a user-defined type
You could use a bitwise approach (in mysql the easiest way for this is using the SET datatype)

But again: what is your current problem, since finding out the number of days someone was present is nothing more than joining the appropriate tables, and aggregate with a count function

davyM 2010-07-07 09:15:46

I agree, one table is enough. Database server are fast , so there is no need to denormalize.

mb14 2010-07-07 09:22:35

@mb14 But when there is big amount of data, It will be beneficial to normalize data.

Himadri 2010-07-07 09:28:33

I wasn't sure if we needed many rows to capture the attendance information (If you look at a excel sheet with such information it's like a grid and you have a set of values for the entire month against a user, all days in a month). I was trying to model that. I do agree with the fact that it would be painful to mark presence or absence of a player within that comma separated list

Samuel 2010-07-07 09:32:45

@Himadari - What would you suggest if the number of rows in these tables are going to be on the high side. What kind of roll over mechanisms can we build, so that older data (say older than a year) or so gets stored in a different way. Any thoughts

Samuel 2010-07-07 09:34:31

@Samuel I don't know about some different method, only that you can take a backup of older data and remove from current database. But I was telling about normalized table.

Himadri 2010-07-07 09:52:53

@himadri, what do you mean by a big amount of data ? I think anyway your design should be denormalize so one table. You create a view or whatever, to calculate what you need. When you have performance issues (in a couple of years) then you just need to create a table corresponding to the view and store the results of the view in it. You can still purge as well the current table to have only the current year for example (and archive everything in another one) , which will prevent the working table to be to big.

mb14 2010-07-07 11:02:34

@mb14 If only one table is enough then why there is a concept of normalization? I believe it is much easy to maintain the database if it is enough normalized. It's not only about to big table there are many issues with database design and normalization.

Himadri 2010-07-07 11:20:21

@himadri. The thing the way of normalizing your problem is ONE table . Having two would be DEnormalizing (which is the opposite of normalising). As you said , normalising is goog but denormalising is bad. Normalising is removing redundant information . In your case the second table can be computed from the first one, so normalizing imply removing the second table. You only need to know who attend which days (first table) and from that you can do everything you want, attendance per week, per month, per year etc.

mb14 2010-07-07 12:29:00

@Samuel: please explain clearly the problems you want to solve or concerns you have. If we know what you want to achieve and why, we can better respond to your question.

davyM 2010-07-07 12:34:24

@mb14 you are confusing me.. have a look at http://databases.about.com/od/specificproducts/a/normalization.htm

Himadri 2010-07-08 04:13:37

Answer 3

+1 A:

AttMst
  id | date

AttDet
  attdetid | id | userid

In this way you need to store day in AttMst and the present users on that day will be stored in AttDet.

Himadri 2010-07-07 09:18:31

am veering towards this model. how would you query all user attendance records for a month? How would you do a join with the AttMst table?

Samuel 2010-07-07 09:24:03

`select date,userid from AttMst am,AttDet ad where am.id = ad.id group by date` This will show datewise present users. But it is untested query please check it.

Himadri 2010-07-07 09:39:42

Answer 4

+2 A:

You use the word "optimize" in the question title without explaining what it is you want to optimize.

If you're talking about query performance then you do not have a problem. The number of records you can have is governed by the number of sessions you have each day (because only one team can attend any given session). If you run ten sessions a day that's three hundred records per month. If you run one hundred sessions a day that is three thousand records a month. These are not big volumes of data. So you are making a bad decision by skewing your database design to avoid a performance problem which isn't there.

You mentioned spreadsheets in one of your comments. That is not a bad design to have. Along the top row there are sessions, down the side there are teams, and the cells show whether a team was present at a session. Those map to three database tables: SESSIONS, TEAMS and the intersection table TEAM_SESSIONS. You only need a record in TEAM_SESSIONS when a team attended a session.

As a proof of concept I knocked up three tables in Oracle.

SQL> desc teams
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                        NOT NULL NUMBER
 NAME                                               VARCHAR2(20 CHAR)

SQL> desc sessions
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                        NOT NULL NUMBER
 SSN_DAY                                            DATE
 SSN_START                                          NUMBER(4,2)
 SSN_END                                            NUMBER(4,2)

SQL> desc team_sessions
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 TEAM_ID                                   NOT NULL NUMBER
 SESSION_ID                                NOT NULL NUMBER

SQL>

The PIVOT function introduced in Oracle 11g makes it a cinch to knock up a matrix (different flavours of DBMS will have different ways to approach this). As you can see, three teams have booked sessions today, nobody wants to train at lunchtime, and Bec United are keen as mustard (or need the training)!

SQL> select * from (
  2      select t.name as team_name
  3             , trim(to_char(s.ssn_start))||'-'||trim(to_char(s.ssn_end)) as ssn
  4             , case when ts.team_id is not null then 1 else 0 end as present
  5      from   sessions s
  6             cross join teams t
  7             left outer join team_sessions ts
  8                  on (ts.team_id = t.id
  9                      and ts.session_id = s.id )
 10      where s.ssn_day = trunc(sysdate)
 11      )
 12  pivot
 13      ( sum (present)
 14        for ssn in ( '9-11', '11-13', '13-15', '15-17', '17-19')
 15      )
 16  order by team_name
 17  /

TEAM_NAME                '9-11'    '11-13'    '13-15'    '15-17'    '17-19'
-------------------- ---------- ---------- ---------- ---------- ----------
Balham Blazers                0          1          0          0          0
Bec United                    1          0          0          0          1
Dinamo Tooting                0          0          0          0          0
Melchester Rovers             0          0          0          1          0

SQL>

Anyway, the virtue of this data model is that it is flexible. We can count how often a team attends, what times they attend, what day of the week they attend, what sessions are always booked, what sessions are rarely booked, etc. Plus it is easy to manage the data. In particular, the advantage of the three table solution over just two tables is that it is easier to prevent double bookings and non-standard or overlapping time slots.

You see, normalisation isn't just some moon language we use to bamboozle the innocent, it offers real practical benefits. There are few scenarios where driving down to at least BCNF is not the best idea.

APC 2010-07-07 12:39:38

ansaurus

tags:

views:

answers:

What is the best way to optimize schema for capturing attendance data

related questions