ansaurus

Question

Writing csv files with python with exact formatting parameters

Answer 1

+2 A:

If I had a cat for each time I saw a bio or psych or chem database in this state:

"each column contains 50 experiments, each with 4000 rows, for 200000 some rows total. What I want to do is take each column, and make it an individual csv file, with each experiment in its own column. So it would be an array of 50 columns and 4000 rows for each data type"

I'd have way too farking many cats.

I didn't even look at your code because the re-mangling you are proposing is just another problem that will have to be solved. I don't fault you, you claim to be a novice and all your peers make the same sort of error. Beginning programmers who have yet to understand how to use arrays often wind up with variable declarations like:

integer response01, response02, response03, response04, ...

and then very, very redundant code when they try to see if every response is - say - 1. I think this is such a seductive error in bio-informatics because it actually models the paper notations they come from rather well. Unfortunately, the sheet-of-paper model isn't the best way to model data.

You should read and understand why database normalization was developed, codified and has come to dominate how people think about structured data. One Wikipedia article may not be sufficient. Using the example I excerpted let me try to explain how I think of it. Your data consists of observations; put the other way the primary datum is a singular observation. That observation has a context though: it is one of a set of 4000 observations, where each set belongs to one of 50 experiments. If you had to attach a context to each observation you'd wind up with an addressing scheme that looks like:

<experiment_number, observation_number, value>

In database jargon, that's a tuple, and it is capable of representing, with no ambiguity and perfect symmetry the entirety of your data. I'm not certain that I've understood the exact structure of your data, so perhaps it is something more like:

<experiment_number, protocol_number, observation_number, value>

where the protocol may be some form of variable treatment type - let's say pH. But note that I didn't call the protocol a pH and I don't record it as such in the database. What I would then need is an ancillary table showing the relevant parameters of the protocol, e.g.:

<protocol_number, acidity, temperature, pressure>

Now we've just built a "relation" that those database people like to talk about; we've also begun normalizing the data. If you need to know the pH for a given protocol, there is one and only one place to find it, in the proper row of the protocol table. Note that I've divorced the data that fit so nicely together on a data-sheet and from the observation table I can't see the pH for a particular dataum. But that's okay, because I can just look it up in my protocol table if needed. This is a "relational join" and if I needed to, I could coalesce all the various parameters from all the various tables and reconstitute the original datasheet in its original, unstructured glory.

I hope this answer is of some use to you. I'm certain that I don't even know what field of study your data is from, but these principles apply across domains from drug trials to purchase requisition processing. Please understand that I'm trying to inform, per your request, and there is zero condescension intended. I welcome further questions on the matter.

msw 2010-05-29 04:34:03

Alright, unfortunately, I've no idea how to attach context information to data either. The principles probably do apply, I think I understand what you are getting at with this. For longer term I could see why this would be helpful, but I've always worked with stuff the way I described. I'm curious what the problem is if it is not the code? It does work for what we are doing, even if not in the best manner. The dept is philosophy. Experimental data analysis is very controversial in my dept, generally not an approved method, and therefore not taught. I'll keep an eye on this. Thank you.

Ben Harrison 2010-05-29 05:23:09

+1 for an answer that proves this statement wrong: "This is a question and answer site. Not a 'complex question, insightful answer that grows you as a person site.' People should be able to ask simple questions with simple answers, that just may so happen to be spoon-fed. A lot of people just want to write code that works. Not be empowered." —Owen ( http://meta.stackoverflow.com/questions/8724/how-to-deal-with-google-questions/8728#8728 )

Adam Bernier 2010-05-29 06:05:34

@Adam: thank you for the kind words, I'm not certain whether I should say "oops" for violating meta.* ;) @Ben: thanks for the clarification, I must say that "experimental epistemology" was a new idea for me (and apparently for philosophy). Having done some reading on the subject, it is really applying sociological or anthropological methods to philosophical questions. As a relatively new sub-domain trying to establish itself in the canon, it seems to me that you should use the best analytic practices developed over the last century, not muddle along with the most frequent novice errors.

msw 2010-05-29 13:42:45

I like philosophy, some of my best friends are philosophers. I'd not seen your re-mangling example until just now. "It does work" may be so as you assert, but "works" and "works properly" are two different things. I can think of few things that would irritate philosophers more than being laughed at by philosophers of science for method. By ignoring analytic methods appropriate to the task, you do the emergent field a disservice.

msw 2010-05-29 13:46:43

Answer 2

A:

Normalization of the dataset

Thanks for giving the example. You have the context I described already, perhaps I can make it more clear.

column1             column2            column3
exp1data1time1      exp1data2time1     exp1data3time1
exp1data1time2      exp1data2time2     exp1data3time2

The columns are an artifice made by the last guy; that is, they carry no relevant information. When parsed into a normal form, your data looks just like my first proposed tuple:

<experiment_number, time, response_number, response>

where I suspect time may actually mean "subject_id" or "trial_number". It may very well look incongruous to you to conjoin all the different response values into the same dataset; indeed based on your desired output, I suspect that it does. At first blush, the objection "but the subject's response to a question about epistemic properties of chairs has no connection to their meta-epistemic beliefs regarding color", but this would be mistaken. The data are related because they have a common experimental subject, and self-correlation is an important concept in sociological analytics.

For example, you may find that respondent A gives the same responses as respondent B, except all of A's responses are biased one higher because of how the subject understood the criteria. This would make a very real difference in the absolute values of the data, but I hope you can see that the question "do A and B actually have different epistemic models?" is salient and valid. One method of data modeling allows this question to be answered easily, your desired method does not.

Working parsing code to follow shortly.

msw 2010-05-29 14:17:28

Answer 3

A:

The normalizing code

#!/usr/bin/python

"""parses a csv file containing a particular data layout and normalizes

    The raw data set is a csv file of the form::

        column1                column2               column3
        exp01data01time01      exp01data02time01     exp01data03time01
        exp01data01time02      exp01data02time02     exp01data03time02

    where there are 40 such columns and the literal column title
    is added as context to the output row

    it is assumed that the columns are comma separated but
    the lexical form of the subcolumns is unspecified.

    Output will consist of a single CSV output stream
    on stdout of the form::

        exp01, time01, data01, column1

    for varying actual values of each field.
"""

import csv
import sys

def split_subfields(s):
    """returns a list of subfields of s
       this function is expected to be re-written to match the actual,
       unspecified lexical structure of s."""
    return [s[0:5], s[5:11], s[11:17]]


def normalise_data(reader, writer):
    """returns a list of the column headings from the reader"""

    # obtain the headings for use in normalization
    names = reader.next()

    # get the data rows, split them out by column, add the column name
    for row in reader:
        for column, datum in enumerate(row):
            fields = split_subfields(datum)
            fields.append(names[column])
            writer.writerow(fields)

def main():
    if len(sys.argv) != 2:
        print  >> sys.stderr,  ('usage: %s input.csv' % sys.argv[0])
        sys.exit(1)

    in_file = sys.argv[1]

    reader = csv.reader(open(in_file))
    writer = csv.writer(sys.stdout)
    normalise_data(reader, writer)

if __name__ == '__main__': main()

Such that the command python epistem.py raw_data.csv > cooked_data.csv yields excerpted output looking like:

exp01,data01,time01,column1
...
exp01,data40,time01,column40
exp01,data01,time02,column1
exp01,data01,time03,column1
...
exp02,data40,time15,column40

msw 2010-05-29 18:32:08

Thank you. I will look at the data arrangement. My professor that is heading this thing doesn't care, he'd rather see the resultant graphs etc. There is a second professor (of sociology) who has had some involvement on and off, so I'll run it by him. See how he would have wanted the data if he was analyzing it. The exp is actually about knowledge acquisition, rather than processing/beliefs, so the values are at different times. To see the amount of time to achieve. Finally, the heart of the matter. We are attempting to use actual data to build computer models of knowledge acquisition.

Ben Harrison 2010-05-29 21:24:24

So, this set of data is from the computer model alone, and the epistemic models are specifically accounted for already in it. Alright, thank you, g'day

Ben Harrison 2010-05-29 21:36:58

ansaurus

tags:

views:

answers:

Writing csv files with python with exact formatting parameters

Normalization of the dataset

The normalizing code

related questions