tags:

views:

58

answers:

3

I'm a computer science teacher trying to create a little gradebook for myself using NumPy. But I think it would make my code easier to write if I could create an ndarray that uses field names for both the rows and columns. Here's what I've got so far:

import numpy as np
num_stud = 23
num_assign = 2
grades = np.zeros(num_stud, dtype=[('assign 1','i2'), ('assign 2','i2')]) #etc
gv = grades.view(dtype='i2').reshape(num_stud,num_assign)

So, if my first student gets a 97 on 'assign 1', I can write either of:

grades[0]['assign 1'] = 97
gv[0][0] = 97

Also, I can do the following:

np.mean( grades['assign 1'] ) # class average for assignment 1
np.sum( gv[0] ) # total points for student 1

This all works. But what I can't figure out how to do is use a student id number to refer to a particular student (assume that two of my students have student ids as shown):

grades['123456']['assign 2'] = 95
grades['314159']['assign 2'] = 83

...or maybe create a second view with the different field names?

np.sum( gview2['314159'] ) # total points for the student with the given id

I know that I could create a dict mapping student ids to indices, but that seems fragile and crufty, and I'm hoping there's a better way than:

id2i = { '123456': 0, '314159': 1 }
np.sum( gv[ id2i['314159'] ] )

I'm also willing to re-architect things if there's a cleaner design. I'm new to NumPy, and I haven't written much code yet, so starting over isn't out of the question if I'm Doing It Wrong.

I am going to be needing to sum all the assignment points for over a hundred students once a day, as well as run standard deviations and other stats. Plus, I'll be waiting on the results, so I'd like it to run in only a couple of seconds.

Thanks in advance for any suggestions.

+2  A: 

For entering and storing the data, I would use a relational database like (sqlite, MySQL or Postgresql). If you do it this way, you can easily write multiple programs which analyze the data in different ways. The sqlite database itself can be accessed from a variety of programming languages, GUI/CLI interfaces. Your data would remain language agnostic (unlike storing numpy arrays).

Python has built-in support for sqlite.

SQL provides a convenient, readable language for slicing and dicing your data (e.g. "What are all the scores for assignment1 from class1? Give a list of the 10 highest scores. Who had those scores? Did class1 have a higher average than class2?) The database tables would accommodate multiple classes, multiple semesters easily.

For entering data, a GUI may be the most convenient. For sqlite there is sqlitebrowser (I don't have a lot of experience here though; there might be even better options.). For MySQL I like phpmyadmin, and for Postgresql, phppgadmin.

Once you have the data entered, you can use a Python module (e.g. sqlite3, MySQLdb, pyscopg2) to access the database, and issue SQL queries. The data can then be fed into a list or numpy array. You can then use numpy to compute statistics.

PS. For small datasets there is really no issue regarding speed or memory footprint. You do not have to store the data in a numpy array just to call numpy/scipy statistics functions.

You could, for example, draw the data out of the database and into a Python list, and feed the Python list to a numpy function:

sql='SELECT * FROM grades where assignment=%s'
args=['assign1']
data=cursor.fetchall(sql,args)
scores=zip(*data)[0]   
ave_score=np.mean(scores)

If grades is a numpy structured array, you'll never be able to access values this way:

grades['123456']['assign 2']

since columns are accessed by name, and rows are accessed by non-negative integers.

I don't think this poses much of an obstacle however. Here's why: Everything you want to do for one student (like find the sum of all assignment points), you'll probably want to do for every student.

So the trick with numpy -- the way to leverage its power -- is to write vectorized equations or use numpy functions that apply to all rows simultaneously, instead of looping over rows individually. Instead of thinking on an individual scale (e.g. individual students, individual assignments), numpy encourages you to think on a grander scale (e.g. all students, all assignments) and to do calculations that apply to all of them at once.

As you've seen with your wrangling with views, you are actually better off not using a structured array, instead opting for a plain 2-axis numpy array:

Let's imagine the columns represent (2) assignments and the rows represent (4) students.

In [36]: grades=np.random.random((4,2))

In [37]: grades
Out[37]: 
array([[ 0.42951657,  0.81696305],
       [ 0.2298493 ,  0.05389136],
       [ 0.12036423,  0.78142328],
       [ 0.5029192 ,  0.75186565]])

Here are some statistics:

In [38]: sum_of_all_assignments = grades.sum(axis=1)

In [39]: sum_of_all_assignments
Out[39]: array([ 1.24647962,  0.28374066,  0.90178752,  1.25478485])

In [40]: average_of_all_assignments = grades.mean(axis=1)

In [41]: average_of_all_assignments
Out[41]: array([ 0.62323981,  0.14187033,  0.45089376,  0.62739242])

In [42]: average_assignment_score = grades.mean(axis=0)

In [43]: average_assignment_score 
Out[43]: array([ 0.32066233,  0.60103583])

Now suppose these are the names of the students:

In [44]: student_names=['harold','harry','herb','humphrey']

To match student names with their average score, you could create the dict

In [45]: dict(zip(student_names,average_of_all_assignments))
Out[45]: 
{'harold': 0.62323981076528523,
 'harry': 0.14187032892653173,
 'herb': 0.45089375919011698,
 'humphrey': 0.62739242488169067}

And similarly, for assignments:

In [46]: assignment_names=['assign 1','assign 2']

In [47]: dict(zip(assignment_names,average_assignment_score))
Out[47]: {'assign 1': 0.32066232713749887, 'assign 2': 0.60103583474431344}
unutbu
Thanks for the detailed answer, even if the conclusion is what I feared: what I want isn't possible with an ndarray. Unfortunately, my data is going to come in one assignment and one student at a time, and all I've got at first is the name of the assignment (not its number in the sequence of assignments), the student ID number of the student, and the number of points granted for the assignment. Should I edit the question to reflect that I'm more interested in how to easily populate the array?
Graham Mitchell
Thanks again for the additional info, and especially for taking so much time to give suggestions. I think sqlite is overkill, even. Also, the grading part is already handled; I've got a "GUI" in a web page where I grade the assignments and then submit the grades to my "gradebook" server via CGI. So the assignments literally come in as "p01-Mitchell_314159-twentyquestions-java=35" from CGI.
Graham Mitchell
+2  A: 

From you description, you'd be better off using a different data structure than a standard numpy array. ndarrays aren't well suited to this... They're not spreadsheets.

However, there has been extensive recent work on a type of numpy array that is well suited to this use. Here's a description of the recent work on DataArrays. It will be a while before this is fully incorporated into numpy, though...

One of the projects that the upcoming numpy DataArrays is (sort of) based on is "larry" (Short for "Labeled Array"). This project sounds like exactly what you're wanting to do... (Have named rows and columns but otherwise act transparently as a numpy array.) It should be stable enough to use, (and from my limited playing around with it, it's pretty slick!) but keep in mind that it will probably be replaced by a built-in numpy class eventually.

Nonetheless, you can make good use of the fact than (simple) indexing of a numpy array returns a view, into that array, and make a class that provides both interfaces...

Alternatively, @unutbu's suggestion above is another (more simple and direct) way of handling it, if you decide to roll your own.

Joe Kington
This might be exactly what I'm looking for! I wish I could upvote you right now, but my reputation is too low! Thanks a bunch.
Graham Mitchell
@Graham, you can accept the answer and then upvote it.
Justin Peel
@Justin - No, I can't vote up anything until I have a reputation score of 15.
Graham Mitchell
@Graham, ok, I thought that you could after accepting the answer for some reason.
Justin Peel
A: 

Sorry, that is impossible.

JSosa23
Could you be more specific? What are you saying is impossible?
Graham Mitchell