tags:

views:

104

answers:

2

i have this code:

import csv
import collections

def do_work():
      (data,counter)=get_file('thefile.csv')
      b=samples_subset1(data, counter,'/pythonwork/samples_subset3.csv',500)
      return

def get_file(start_file):

        with open(start_file, 'rb') as f:
            data = list(csv.reader(f))
            counter = collections.defaultdict(int)

            for row in data:
              counter[row[10]] += 1
            return (data,counter)

def samples_subset1(data,counter,output_file,sample_cutoff):

      with open(output_file, 'wb') as outfile:
          writer = csv.writer(outfile)
          b_counter=0
          b=[]
          for row in data:
              if counter[row[10]] >= sample_cutoff:
                 b.append(row) 
                 writer.writerow(row)
                 b_counter+=1
      return (b)

i recently started learning python, and would like to start off with good habits. therefore, i was wondering if you can help me get started to turn this code into classes. i dont know where to start.

+2  A: 

Well, I'm not sure what you want to turn into a class. Do you know what a class is? You want to make a class to represent some type of thing. If I understand your code correctly, you want to filter a CSV to show only those rows whose row[ 10 ] is shared by at least sample_cutoff other rows. Surely you could do that with an Excel filter much more easily than by reading through the file in Python?

What the guy in the other thread suggested is true, but not really applicable to your situation. You used a lot of global variables unnecessarily: if they'd been necessary to the code you should have put everything into a class and made them attributes, but as you didn't need them in the first place, there's no point in making a class.

Some tips on your code:

  • Don't cast the file to a list. That makes Python read the whole thing into memory at once, which is bad if you have a big file. Instead, simply iterate through the file itself: for row in csv.reader(f): Then, when you want to go through the file a second time, just do f.seek(0) to return to the top and start again.

  • Don't put return at the end of every function; that's just unnecessary. You don't need parentheses, either: return spam is fine.

Rewrite

import csv
import collections

def do_work():
    with open( 'thefile.csv' ) as f:
        # Open the file and count the rows.
        data, counter = get_file(f)

        # Go back to the start of the file.
        f.seek(0)

        # Filter to only common rows.
        b = samples_subset1(data, counter, 
            '/pythonwork/samples_subset3.csv', 500)

     return b

def get_file(f):
    counter = collections.defaultdict(int)
    data = csv.reader(f)

    for row in data:
        counter[row[10]] += 1

    return data, counter

def samples_subset1(data, counter, output_file, sample_cutoff):
    with open(output_file, 'wb') as outfile:
        writer = csv.writer(outfile)
        b = []
        for row in data:
            if counter[row[10]] >= sample_cutoff:
                b.append(row) 
                writer.writerow(row)

    return b
katrielalex
thank you. excel is not the right tool for this job. vba is a piece of garbage compared to python :)
I__
Agree 100%. What you've done so far coming from your original code is to create functions that nicely react only to what's given to them as inputs, and have an effect only to what they're going to return as values. This, in itself, is a Good Thing ;) Making this a class reintroduces the danger of manipulating variables in a less obvious fashion than you're doing now, while the complexity of the state isn't high enough that you'd benefit at the same time.
Nicolas78
Are you sure Excel isn't the right tool for the job? I don't mean that you should write a macro in VBA (ugh!!!)! You're doing data filtering: you want to show row n only if `=COUNTIF(K1:K.,Kn)>=sample_cutoff`. That's built into Excel.
katrielalex
im sure it is, as ive had lots of experience with it
I__
Did you actually read what I wrote? I'll tell you: Excel does this, and it does it very simply. Step 1: make a new column in your CSV, with formula `=COUNTIF(K1:K.,Kn)>=sample_cutoff`. Step 2: Data->AutoFilter. Step 3: In the top of the new column, click the drop-down menu button and select TRUE. You're done, no Python involved.
katrielalex
yes you are right if it were just that function then excel would be my choice, but there are a bunch of other steps i need ot do after this
I__
btw my file is 100megs, it would be difficult to do anything with in excel with a file that big without crashing my computer
I__
OK, sorry. Just wanted to make sure.
katrielalex
no i really appreciate your help
I__
please see comments to your suggestions above this questino
I__
+4  A: 

Per my comment on the original post, I don't think a class is necessary here. Still, if other Python programmers will ever read this, I'd suggest getting it inline with PEP8, the Python style guide. Here's a quick rewrite:

import csv
import collections

def do_work():
    data, counter = get_file('thefile.csv')
    b = samples_subset1(data, counter, '/pythonwork/samples_subset3.csv', 500)

def get_file(start_file):
    with open(start_file, 'rb') as f:
        counter = collections.defaultdict(int)
        data = list(csv.reader(f))

        for row in data:
            counter[row[10]] += 1

    return (data, counter)

def samples_subset1(data, counter, output_file, sample_cutoff):
    with open(output_file, 'wb') as outfile:
        writer = csv.writer(outfile)
        b = []
        for row in data:
            if counter[row[10]] >= sample_cutoff:
                b.append(row) 
                writer.writerow(row)

    return b

Notes:

  1. No one uses more than 4 spaces to indent ever. Use 2 - 4. And all your levels of indentation should match.
  2. Use a single space after the commas between arguments to functions ("F(a, b, c)" not "F(a,b,c)")
  3. Naked return statements at the end of a function are meaningless. Functions without return statements implicitly return None
  4. Single space around all operators (a = 1, not a=1)
  5. Do not wrap single values in parentheses. It looks like a tuple, but it isn't.
  6. b_counter wasn't used at all, so I removed it.
  7. csv.reader returns an iterator, which you are casting to a list. That's usually a bad idea because it forces Python to load the entire file into memory at once, whereas the iterator will just return each line as needed. Understanding iterators is absolutely essential to writing efficient Python code. I've left data in for now, but you could rewrite to use an iterator everywhere you're using data, which is a list.
Triptych
what would that look like?
I__
this part does not work for me: for row in csv.reader(f):, it returns nothing
I__
My fault - didn't notice you were returning the `data` variable. That changes things.
Triptych
@l__: to get rid of the list cast, just do `data = csv.reader(f)`. Then, as well as passing `data` to `samples_subset1`, pass `f` and call `f.seek(0)` before iterating through it again. That alone should be a huge performance increase.
katrielalex
katriealex, can u give me an example of how i would pass f.seek(0) ??
I__
File "C:\pythonwork\readthefile072810.py", line 6, in do_work b=samples_subset1(data,counter,'/pythonwork/samples_subset4.csv',500,f) File "C:\pythonwork\readthefile072810.py", line 23, in samples_subset1 f.seek(0)ValueError: I/O operation on closed file
I__
what do i do with f in samples_subset1
I__
Oh, I see. Basically, what you did before was to read the file into a list in memory and then close the file. You don't have to read it all in at once -- but then, you can't close the file. The `with` statement automagically closes f when it terminates, which in this case is bad. I'll add a new post with a rewrite.
katrielalex