Background: I'm running multiple simuations on a set of data. For each session, I'm allocating projects to students. The difference between each session is that I'm randomising the order of the students such that all the students get a shot at being assigned a project they want. I was writing out some of the allocations in a spreadsheet (i.e. Excel) and it basically looked like this (tiny snapshot, actual table extends to a few thousand sessions, roughly 100 students).
| | Session 1 | Session 2 | Session 3 |
|----------|-----------|-----------|-----------|
|Stu1 |Proj_AA |Proj_AB |Proj_AB |
|----------|-----------|-----------|-----------|
|Stu2 |Proj_AB |Proj_AA |Proj_AC |
|----------|-----------|-----------|-----------|
|Stu3 |Proj_AC |Proj_AC |Proj_AA |
|----------|-----------|-----------|-----------|
Now, the code that deals with the allocation currently stores a session in an object. The next time the allocation is run, the object is over-written.
Thus what I'd really like to do is to store all the allocation results. This is important since I later need to derive from the data, information such as: which project Stu1
got assigned to the most or perhaps how popular Proj_AC
was (how many times it was assigned / number of sessions).
Question(s):
What methods can I possibly use to basically store such session information persistently? Basically, each session output needs to add itself to the repository after ending and before beginning the next allocation cycle.
One solution that was suggested by a friend was mapping these results to a relational database using SQLAlchemy
. I kind of like the idea since this does give me an opportunity to delve into databases.
Now the database structure I was recommended was:
|----------|-----------|-----------|
|Session |Student |Project |
|----------|-----------|-----------|
|1 |Stu1 |Proj_AA |
|----------|-----------|-----------|
|1 |Stu2 |Proj_AB |
|----------|-----------|-----------|
|1 |Stu3 |Proj_AC |
|----------|-----------|-----------|
|2 |Stu1 |Proj_AB |
|----------|-----------|-----------|
|2 |Stu2 |Proj_AA |
|----------|-----------|-----------|
|2 |Stu3 |Proj_AC |
|----------|-----------|-----------|
|3 |Stu1 |Proj_AB |
|----------|-----------|-----------|
|3 |Stu2 |Proj_AC |
|----------|-----------|-----------|
|3 |Stu3 |Proj_AA |
|----------|-----------|-----------|
Here, it was suggested that I make the Session
and Student
columns a composite key. That way I can access a specific record for a particular student for a particular session. Or I can merely get the entire allocation run for a particular session.
Questions:
Is the idea a good one?
How does one implement and query a composite key using SQLAlchemy
?
What happens to the database if a particular student is not assigned a project (happens if all projects that he wants are taken)? In the code, if a student is not assigned a project, instead of a proj_id
he simply gets None
for that field/object.
I apologise for asking multiple questions but since these are closely-related, I thought I'd ask them in the same space.
Edit - 25/03/2010
The class structure for Student currently stands like so:
class Student(DeptPerson):
def __init__(self, name, stud_id):
super(Student, self).__init__(name, stud_id)
self.preferences = collections.defaultdict(set)
self.allocated_project = None
self.allocated_rank = 0
(Misc.) It inherits from a class called DeptPerson
that just has name
and stud_id
.
Thus, as the allocation function goes about assigned projects (referenced by a unique ID - Project
is its own class) to a student (for a single run/trial/session), it will reflect the value in allocated_project
. If a student isn't assigned a project for a particular run (because someone else already has it, mwhahah...sorry), allocated_project
remains as None
(this is pretty useful information for me).