ansaurus

Question

ndarray field names for both row and column?

Answer 1

+2 A:

For entering and storing the data, I would use a relational database like (sqlite, MySQL or Postgresql). If you do it this way, you can easily write multiple programs which analyze the data in different ways. The sqlite database itself can be accessed from a variety of programming languages, GUI/CLI interfaces. Your data would remain language agnostic (unlike storing numpy arrays).

Python has built-in support for sqlite.

SQL provides a convenient, readable language for slicing and dicing your data (e.g. "What are all the scores for assignment1 from class1? Give a list of the 10 highest scores. Who had those scores? Did class1 have a higher average than class2?) The database tables would accommodate multiple classes, multiple semesters easily.

For entering data, a GUI may be the most convenient. For sqlite there is sqlitebrowser (I don't have a lot of experience here though; there might be even better options.). For MySQL I like phpmyadmin, and for Postgresql, phppgadmin.

Once you have the data entered, you can use a Python module (e.g. sqlite3, MySQLdb, pyscopg2) to access the database, and issue SQL queries. The data can then be fed into a list or numpy array. You can then use numpy to compute statistics.

PS. For small datasets there is really no issue regarding speed or memory footprint. You do not have to store the data in a numpy array just to call numpy/scipy statistics functions.

You could, for example, draw the data out of the database and into a Python list, and feed the Python list to a numpy function:

sql='SELECT * FROM grades where assignment=%s'
args=['assign1']
data=cursor.fetchall(sql,args)
scores=zip(*data)[0]   
ave_score=np.mean(scores)

If grades is a numpy structured array, you'll never be able to access values this way:

grades['123456']['assign 2']

since columns are accessed by name, and rows are accessed by non-negative integers.

I don't think this poses much of an obstacle however. Here's why: Everything you want to do for one student (like find the sum of all assignment points), you'll probably want to do for every student.

So the trick with numpy -- the way to leverage its power -- is to write vectorized equations or use numpy functions that apply to all rows simultaneously, instead of looping over rows individually. Instead of thinking on an individual scale (e.g. individual students, individual assignments), numpy encourages you to think on a grander scale (e.g. all students, all assignments) and to do calculations that apply to all of them at once.

As you've seen with your wrangling with views, you are actually better off not using a structured array, instead opting for a plain 2-axis numpy array:

Let's imagine the columns represent (2) assignments and the rows represent (4) students.

In [36]: grades=np.random.random((4,2))

In [37]: grades
Out[37]: 
array([[ 0.42951657,  0.81696305],
       [ 0.2298493 ,  0.05389136],
       [ 0.12036423,  0.78142328],
       [ 0.5029192 ,  0.75186565]])

Here are some statistics:

In [38]: sum_of_all_assignments = grades.sum(axis=1)

In [39]: sum_of_all_assignments
Out[39]: array([ 1.24647962,  0.28374066,  0.90178752,  1.25478485])

In [40]: average_of_all_assignments = grades.mean(axis=1)

In [41]: average_of_all_assignments
Out[41]: array([ 0.62323981,  0.14187033,  0.45089376,  0.62739242])

In [42]: average_assignment_score = grades.mean(axis=0)

In [43]: average_assignment_score 
Out[43]: array([ 0.32066233,  0.60103583])

Now suppose these are the names of the students:

In [44]: student_names=['harold','harry','herb','humphrey']

To match student names with their average score, you could create the dict

In [45]: dict(zip(student_names,average_of_all_assignments))
Out[45]: 
{'harold': 0.62323981076528523,
 'harry': 0.14187032892653173,
 'herb': 0.45089375919011698,
 'humphrey': 0.62739242488169067}

And similarly, for assignments:

In [46]: assignment_names=['assign 1','assign 2']

In [47]: dict(zip(assignment_names,average_assignment_score))
Out[47]: {'assign 1': 0.32066232713749887, 'assign 2': 0.60103583474431344}

unutbu 2010-10-12 00:32:21

Thanks for the detailed answer, even if the conclusion is what I feared: what I want isn't possible with an ndarray. Unfortunately, my data is going to come in one assignment and one student at a time, and all I've got at first is the name of the assignment (not its number in the sequence of assignments), the student ID number of the student, and the number of points granted for the assignment. Should I edit the question to reflect that I'm more interested in how to easily populate the array?

Graham Mitchell 2010-10-12 02:34:55

Thanks again for the additional info, and especially for taking so much time to give suggestions. I think sqlite is overkill, even. Also, the grading part is already handled; I've got a "GUI" in a web page where I grade the assignments and then submit the grades to my "gradebook" server via CGI. So the assignments literally come in as "p01-Mitchell_314159-twentyquestions-java=35" from CGI.

Graham Mitchell 2010-10-12 13:46:58

Answer 2

+2 A:

From you description, you'd be better off using a different data structure than a standard numpy array. ndarrays aren't well suited to this... They're not spreadsheets.

However, there has been extensive recent work on a type of numpy array that is well suited to this use. Here's a description of the recent work on DataArrays. It will be a while before this is fully incorporated into numpy, though...

One of the projects that the upcoming numpy DataArrays is (sort of) based on is "larry" (Short for "Labeled Array"). This project sounds like exactly what you're wanting to do... (Have named rows and columns but otherwise act transparently as a numpy array.) It should be stable enough to use, (and from my limited playing around with it, it's pretty slick!) but keep in mind that it will probably be replaced by a built-in numpy class eventually.

Nonetheless, you can make good use of the fact than (simple) indexing of a numpy array returns a view, into that array, and make a class that provides both interfaces...

Alternatively, @unutbu's suggestion above is another (more simple and direct) way of handling it, if you decide to roll your own.

Joe Kington 2010-10-12 00:41:38

This might be exactly what I'm looking for! I wish I could upvote you right now, but my reputation is too low! Thanks a bunch.

Graham Mitchell 2010-10-12 02:39:30

@Graham, you can accept the answer and then upvote it.

Justin Peel 2010-10-12 06:11:35

@Justin - No, I can't vote up anything until I have a reputation score of 15.

Graham Mitchell 2010-10-12 13:22:29

@Graham, ok, I thought that you could after accepting the answer for some reason.

Justin Peel 2010-10-12 15:28:29

Answer 3

A:

Sorry, that is impossible.

JSosa23 2010-10-12 15:52:13

Could you be more specific? What are you saying is impossible?

Graham Mitchell 2010-10-12 16:52:24

ansaurus

tags:

views:

answers:

ndarray field names for both row and column?

related questions