tags:

views:

87

answers:

2

here is a snapshot of my csv:

alex    123f    1
harry   fwef    2
alex    sef 3
alex    gsdf    4
alex    wf35    6
harry   sdfsdf  3

i would like to get the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:

alex    123f    1
alex    sef 3
alex    gsdf    4
alex    wf35    6
+1  A: 

if Python is not a must

$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file

And yes, 70MB file is fine.

ghostdog74
The first part appears to select all records which end with the digit 4 followed by optional spaces an/or tabs. The OP appears to want records for names which appear at least 4 times. Your solution **accidentally** produces the same output as the OP's required output. Add another record `tom pqrs 4`; your solution prints that, contrary to the OP's requirements.
John Machin
what is this? unix?
I__
Unix and windows (GNU win32)
ghostdog74
@john, yes, i think i misread the requirement.
ghostdog74
+3  A: 

Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:

import csv

with open('thefile.csv', 'rb') as f:
  data = list(csv.reader(f))

then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:

import collections
counter = collections.defaultdict(int)
for row in data:
    counter[row[0]] += 1

and finally do the selection loop...:

for row in data:
    if counter[row[0]] >= 4:
        print row

Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.

Alex Martelli
its 70megs, is that ok?
I__
@I___, yep, it will be just fine. A typical modern machine has no less than a GB (a thousand megs), so it will have ample space to read in and process your seventy-megs file (even though the data typically takes up a bit more space in memory than it does on disk).
Alex Martelli
@I__: try it; it's only 10 lines of code and it won't break your computer if it runs out of memory ... if it does, then you can write some code to implement the outline that I gave you in a comment to your non-question.
John Machin
hi alex, this is not working for me, it's returning the wrong subset
I__
@I__, if your delimiter, as it appears, is a tab (rather than a comma, the default delimiter from which Comma Separated Variable files, aka CSV files, are named), of course you have to make you reader slightly different, e.g. `data = list(csv.reader(f, delimiter='\t'))` rather than the default way I built it. Apart from that obvious issue, the code, applied to your example input file, gives the expected output.
Alex Martelli