views:

169

answers:

3

Background: I'm running multiple simuations on a set of data. For each session, I'm allocating projects to students. The difference between each session is that I'm randomising the order of the students such that all the students get a shot at being assigned a project they want. I was writing out some of the allocations in a spreadsheet (i.e. Excel) and it basically looked like this (tiny snapshot, actual table extends to a few thousand sessions, roughly 100 students).

|          | Session 1 | Session 2 | Session 3 |
|----------|-----------|-----------|-----------|
|Stu1      |Proj_AA    |Proj_AB    |Proj_AB    |
|----------|-----------|-----------|-----------|
|Stu2      |Proj_AB    |Proj_AA    |Proj_AC    |
|----------|-----------|-----------|-----------|
|Stu3      |Proj_AC    |Proj_AC    |Proj_AA    |
|----------|-----------|-----------|-----------|

Now, the code that deals with the allocation currently stores a session in an object. The next time the allocation is run, the object is over-written.

Thus what I'd really like to do is to store all the allocation results. This is important since I later need to derive from the data, information such as: which project Stu1 got assigned to the most or perhaps how popular Proj_AC was (how many times it was assigned / number of sessions).

Question(s):
What methods can I possibly use to basically store such session information persistently? Basically, each session output needs to add itself to the repository after ending and before beginning the next allocation cycle.

One solution that was suggested by a friend was mapping these results to a relational database using SQLAlchemy. I kind of like the idea since this does give me an opportunity to delve into databases.

Now the database structure I was recommended was:

|----------|-----------|-----------|
|Session   |Student    |Project    |
|----------|-----------|-----------|
|1         |Stu1       |Proj_AA    |
|----------|-----------|-----------|
|1         |Stu2       |Proj_AB    |
|----------|-----------|-----------|
|1         |Stu3       |Proj_AC    |
|----------|-----------|-----------|
|2         |Stu1       |Proj_AB    |
|----------|-----------|-----------|
|2         |Stu2       |Proj_AA    |
|----------|-----------|-----------|
|2         |Stu3       |Proj_AC    |
|----------|-----------|-----------|
|3         |Stu1       |Proj_AB    |
|----------|-----------|-----------|
|3         |Stu2       |Proj_AC    |
|----------|-----------|-----------|
|3         |Stu3       |Proj_AA    |
|----------|-----------|-----------|

Here, it was suggested that I make the Session and Student columns a composite key. That way I can access a specific record for a particular student for a particular session. Or I can merely get the entire allocation run for a particular session.

Questions:
Is the idea a good one?
How does one implement and query a composite key using SQLAlchemy?
What happens to the database if a particular student is not assigned a project (happens if all projects that he wants are taken)? In the code, if a student is not assigned a project, instead of a proj_id he simply gets None for that field/object.


I apologise for asking multiple questions but since these are closely-related, I thought I'd ask them in the same space.


Edit - 25/03/2010

The class structure for Student currently stands like so:

class Student(DeptPerson):
    def __init__(self, name, stud_id):
        super(Student, self).__init__(name, stud_id)
        self.preferences = collections.defaultdict(set)
        self.allocated_project = None
        self.allocated_rank = 0

(Misc.) It inherits from a class called DeptPerson that just has name and stud_id.

Thus, as the allocation function goes about assigned projects (referenced by a unique ID - Project is its own class) to a student (for a single run/trial/session), it will reflect the value in allocated_project. If a student isn't assigned a project for a particular run (because someone else already has it, mwhahah...sorry), allocated_project remains as None (this is pretty useful information for me).

A: 

Can't help you on the db stuff, as I'm a total newb and only know enough to query data from sqlite tables...

For persistence, though, could you use the pickle module to store your objects? Check the docs for the exact usage, but I think it's pretty much file(filename, 'wb').write(pickle.pickle(myobject)) to write it and myobject = pickle.unpickle(file(filename, 'rb')) to read.

Then you can read multiple tables/whatever into multiple variables and do whatever comparisons you want.

If you don't need/want to read it back in via Python, you could also just manually format it as tab delimited or something and load that file into the spreadsheet app of your choice (OpenOffice Calc is fantastic).

dash-tom-bang
That isn't the correct API for pickle. (Also, pickle is very seldom the best serialization technique available for a given use case.)
Mike Graham
Also, don't use `file` to open files—use `open`, which the documentation will confirm is the right way. (Additionally, always use a context manager, i.e. `with open(filename, mode) as f:`)
Mike Graham
@Mike Graham: Is pickle worth using for this? How about `deepcopy`?
Az
@Az, It looks like a database will both persist your data and be a good tool for accessing and manipulating it. I cannot think of a task for which I think pickle is really the right tool. `copy.deepcopy` does not aid in serialization and will not help with this problem at all.
Mike Graham
@Mike Graham: Much appreciated! At least I know the direction I have to go down now and focus my energies trying to unravel `SQLAlchemy`.
Az
+3  A: 

A two-dimension display of data is a Relational database table with two keys.

In your example, the Student Key and the Session Key.

A "Composite key" is a piece of noise that you can ignore. It isn't helpful and isn't necessary. A composite key does not solve any problems well and create many difficulties. Pretend you never heard it.

What we do is introduce an additional column with a simple "identifier". It's an "autogenerated, surrogate key" for they row. Autogenerated unique keys for each row are a good thing. Composite keys are a bad thing.

S.Lott
@S.Lott: Gotcha. So the Relational Database would be a good way of doing it? I've been told that if I wanted to check how many times a certain project appears for a certain student, I'd just have to run a 'query'? Any other advice/answers for the questions above? Ta in advance.
Az
@Az: If you think "query" is a magical buzzword, you have much work to do. Find books. Read them.
S.Lott
@S.Lott: Is there any particular literature you can recommend for the context of what I'm doing? I'm not looking for anything particularly advanced mainly because I'm on a time leash :S
Az
@S.Lott: If I use an autogenerated key, I don't actually have to filter using that, right (it will number in the hundreds of thousands)?For example, I may want all the student-project pairs for a single session or all the student-project pairs for a particular student across multiple sessions, etc. This autogen key doesn't really affect it does it?Lastly, the `session_id` is linked with my Monte-Carlo simulation. So for each run across all the students, the `session_id` will remain static and increment for the next run. Does `session_id` have to be a class parameter then?
Az
@Az: An autogenerated key is useless for filtering, since it's random, unique numbers. (Usually assigned sequentially, but there are no guarantees other than unique.) Session_id, like student_id and every other piece of data except the autogenerated key, simply follows whatever rules you define for the item and the class that maps the item to the database.
S.Lott
@S.Lott: Right. So far, my student_id and allocated_project_id are mapped from the `Student` class. However, my session_id is to be linked with this Monte-Carlo class I'm creating. Is there a way to have this mapped to the table as well, without making it a part of the Student class?Also, in order to use an autogenerated key, do I just leave an `id` column and let it populate itself?
Az
@Az: "session_id is to be linked with this Monte-Carlo class" Vague. "Link" means it links two things. Which two? It looks like it's just an attribute of each Project in the original question. In order to use an autogenerated key, you're going to have to break down and actually read the SQLAlchemy documentation; perhaps ask a distinct question.
S.Lott
@S.Lott: Ah, basically what I meant was: the session_id comes into play for a Monte-Carlo simulation which basically allocates the students, in a randomised order (so everyone has a chance at getting their first choice), for a certain number of sessions (say, 5000.) As such, it would be a parameter for the Monte-Carlo class. Thus, for the first run, it's 1; for the second, 2; so on.Re: autogen keys: Checking up the SQLAlchemy documentation, it seems that I was right in that the 'id' column, if not mapped from a Class, will auto-increment itself as new records are added.
Az
@Az: "a parameter for the Monte-Carlo class" does not sound like anything that belongs in a database. If it belongs in the database, it's an attribute of one thing or a link between two things.
S.Lott
@S.Lott: Many apologies, used the wrong terminology. Basically, the session_id should be an attribute of the Monte-Carlo class since it is directly related to it (for *any* M-C simulation). I could make it an attribute of the Student class to make the mapping very simple. However, if someone were to take away the Student class and replace it, unless he/she had a session_id attribute in his/her new class, it wouldn't be part of the equation anymore. This is why I'd like to keep it in the M-C class but I'm not sure how I can map it to the database. I think that's what I've been meaning to say.
Az
@Az: "if someone were to take away the Student class and replace it," That makes almost no sense at all. If they take away the Student class they totally break the model and create something entirely new. What **are** you talking about?
S.Lott
@S.Lott: Let's say someone decided to replace the Student class with a similar class called Lecturer, and work out an algorithm that did something related to the Lecturer and Rooms. Also, let's assume that he/she makes sure the object mapping for the Lecturer worked. So, maybe some aspects have changed. There's a lecturer instead of a student, and maybe they're been assigned to Rooms at the beginning of an academic year.
Az
`Continued` This is a different allocation algorithm, but my point is, the `session_id` should not care about that nor should it care about other columns. This `id` is something that belongs to a general Monte-Carlo simulation whose role is to run multiple, randomised sessions of a particular algorithm (whatever the algorithm is). This is why I don't want to keep the `session_id` as part of the Student or Lecturer classes. It is something that I need as part of a Monte-Carlo object. However that means my table has to be mapped from both the Student and the Monte-Carlo object. Is that possible?
Az
@Az: "Let's say someone decided to replace the Student class with a similar class called Lecturer"... This is a *different* question. It's only marginally related to the current question, and certainly, not a topic to be covered in the clumsy, ineffective commenting system.
S.Lott
@S.Lott: Okidokey. I'll post this as a separate question. Ta for the advice!
Az
+1  A: 

The logical structure you have can be considered a Ternary Relationship, where recommended to you table is corresponding to the Attendance relationship object. Therefore ideally you should also create you object model similar to this:

Ternary relationship.

Now, in your case one could argue why do you need more then 1 table, if you have only one field for each of the Entity tables. But I would still model it this way, as this model better represents the real world, and you still need to store somewhere the Project students prefer to work on, which would be another table with many-to-many relationship to Student table.

Working with entities is better and easier for you to understand sqlalchemy; whereas if you just keep one table, how much will you delve into the database really?

About composite keys: S.Lott gave you good reasons to avoid using them, and I fully agree with his take on the topic.

van
Thanks for the diagram. I've updated the question to show what the `Student` class structure looks like.
Az