ansaurus

Question

python newbie question: converting code to classes

Answer 1

+2 A:

Well, I'm not sure what you want to turn into a class. Do you know what a class is? You want to make a class to represent some type of thing. If I understand your code correctly, you want to filter a CSV to show only those rows whose row[ 10 ] is shared by at least sample_cutoff other rows. Surely you could do that with an Excel filter much more easily than by reading through the file in Python?

What the guy in the other thread suggested is true, but not really applicable to your situation. You used a lot of global variables unnecessarily: if they'd been necessary to the code you should have put everything into a class and made them attributes, but as you didn't need them in the first place, there's no point in making a class.

Some tips on your code:

Don't cast the file to a list. That makes Python read the whole thing into memory at once, which is bad if you have a big file. Instead, simply iterate through the file itself: for row in csv.reader(f): Then, when you want to go through the file a second time, just do f.seek(0) to return to the top and start again.
Don't put return at the end of every function; that's just unnecessary. You don't need parentheses, either: return spam is fine.

Rewrite

import csv
import collections

def do_work():
    with open( 'thefile.csv' ) as f:
        # Open the file and count the rows.
        data, counter = get_file(f)

        # Go back to the start of the file.
        f.seek(0)

        # Filter to only common rows.
        b = samples_subset1(data, counter, 
            '/pythonwork/samples_subset3.csv', 500)

     return b

def get_file(f):
    counter = collections.defaultdict(int)
    data = csv.reader(f)

    for row in data:
        counter[row[10]] += 1

    return data, counter

def samples_subset1(data, counter, output_file, sample_cutoff):
    with open(output_file, 'wb') as outfile:
        writer = csv.writer(outfile)
        b = []
        for row in data:
            if counter[row[10]] >= sample_cutoff:
                b.append(row) 
                writer.writerow(row)

    return b

katrielalex 2010-07-28 15:49:08

thank you. excel is not the right tool for this job. vba is a piece of garbage compared to python :)

I__ 2010-07-28 15:51:00

Agree 100%. What you've done so far coming from your original code is to create functions that nicely react only to what's given to them as inputs, and have an effect only to what they're going to return as values. This, in itself, is a Good Thing ;) Making this a class reintroduces the danger of manipulating variables in a less obvious fashion than you're doing now, while the complexity of the state isn't high enough that you'd benefit at the same time.

Nicolas78 2010-07-28 15:51:17

Are you sure Excel isn't the right tool for the job? I don't mean that you should write a macro in VBA (ugh!!!)! You're doing data filtering: you want to show row n only if `=COUNTIF(K1:K.,Kn)>=sample_cutoff`. That's built into Excel.

katrielalex 2010-07-28 16:05:44

im sure it is, as ive had lots of experience with it

I__ 2010-07-28 16:07:51

Did you actually read what I wrote? I'll tell you: Excel does this, and it does it very simply. Step 1: make a new column in your CSV, with formula `=COUNTIF(K1:K.,Kn)>=sample_cutoff`. Step 2: Data->AutoFilter. Step 3: In the top of the new column, click the drop-down menu button and select TRUE. You're done, no Python involved.

katrielalex 2010-07-28 16:12:27

yes you are right if it were just that function then excel would be my choice, but there are a bunch of other steps i need ot do after this

I__ 2010-07-28 16:13:26

btw my file is 100megs, it would be difficult to do anything with in excel with a file that big without crashing my computer

I__ 2010-07-28 16:14:51

OK, sorry. Just wanted to make sure.

katrielalex 2010-07-28 16:40:38

no i really appreciate your help

I__ 2010-07-28 16:52:51

please see comments to your suggestions above this questino

I__ 2010-07-28 16:53:17

Answer 2

+4 A:

Per my comment on the original post, I don't think a class is necessary here. Still, if other Python programmers will ever read this, I'd suggest getting it inline with PEP8, the Python style guide. Here's a quick rewrite:

import csv
import collections

def do_work():
    data, counter = get_file('thefile.csv')
    b = samples_subset1(data, counter, '/pythonwork/samples_subset3.csv', 500)

def get_file(start_file):
    with open(start_file, 'rb') as f:
        counter = collections.defaultdict(int)
        data = list(csv.reader(f))

        for row in data:
            counter[row[10]] += 1

    return (data, counter)

def samples_subset1(data, counter, output_file, sample_cutoff):
    with open(output_file, 'wb') as outfile:
        writer = csv.writer(outfile)
        b = []
        for row in data:
            if counter[row[10]] >= sample_cutoff:
                b.append(row) 
                writer.writerow(row)

    return b

Notes:

No one uses more than 4 spaces to indent ever. Use 2 - 4. And all your levels of indentation should match.
Use a single space after the commas between arguments to functions ("F(a, b, c)" not "F(a,b,c)")
Naked return statements at the end of a function are meaningless. Functions without return statements implicitly return None
Single space around all operators (a = 1, not a=1)
Do not wrap single values in parentheses. It looks like a tuple, but it isn't.
b_counter wasn't used at all, so I removed it.
csv.reader returns an iterator, which you are casting to a list. That's usually a bad idea because it forces Python to load the entire file into memory at once, whereas the iterator will just return each line as needed. Understanding iterators is absolutely essential to writing efficient Python code. I've left data in for now, but you could rewrite to use an iterator everywhere you're using data, which is a list.

Triptych 2010-07-28 15:56:54

what would that look like?

I__ 2010-07-28 16:04:04

this part does not work for me: for row in csv.reader(f):, it returns nothing

I__ 2010-07-28 16:18:26

My fault - didn't notice you were returning the `data` variable. That changes things.

Triptych 2010-07-28 16:26:24

@l__: to get rid of the list cast, just do `data = csv.reader(f)`. Then, as well as passing `data` to `samples_subset1`, pass `f` and call `f.seek(0)` before iterating through it again. That alone should be a huge performance increase.

katrielalex 2010-07-28 16:43:45

katriealex, can u give me an example of how i would pass f.seek(0) ??

I__ 2010-07-28 16:46:21

File "C:\pythonwork\readthefile072810.py", line 6, in do_work b=samples_subset1(data,counter,'/pythonwork/samples_subset4.csv',500,f) File "C:\pythonwork\readthefile072810.py", line 23, in samples_subset1 f.seek(0)ValueError: I/O operation on closed file

I__ 2010-07-28 16:49:05

what do i do with f in samples_subset1

I__ 2010-07-28 16:52:35

Oh, I see. Basically, what you did before was to read the file into a list in memory and then close the file. You don't have to read it all in at once -- but then, you can't close the file. The `with` statement automagically closes f when it terminates, which in this case is bad. I'll add a new post with a rewrite.

katrielalex 2010-07-28 18:06:27

ansaurus

tags:

views:

answers:

python newbie question: converting code to classes

Rewrite

related questions