ansaurus

Question

Sorting csv columns in bash, reading bash output into python variables..

Answer 1

+5 A:

A quick one-liner would be:

grep -a "63[789]\...;" data.csv | sort -n -r -t ';' -k 2 | head --lines=1

This simply sorts the file numerically based on the second column and then prints out the first row. Hope that helps.

Hakop Palyan 2010-10-15 23:56:01

Wrikken 2010-10-16 00:06:23

If there was lots of data being piped between the two commands, it would be faster to read from the head then the tail.

Hakop Palyan 2010-10-16 00:14:40

Answer 2

+1 A:

$ cat data.csv | grep -a "63[789]\...;" | awk 'BEGIN {FS=";"} $2>max{max=$2; val=$1} END {print "max " max " at " val}' 

max 1454.2 at 638.14

ars 2010-10-16 00:02:24

Thank you ars, this line works with the data displayed above, but I still need to filter out the dataset using grep "63[89]\...;", I tried piping it in, but it doesn't work. This line will not find a number on the csv because it contains a header etc ...

gletscher 2010-10-16 00:23:38

@gletscher: looks like you found a solution, but I've updated the answer to show how you'd pipe it to awk (which works on my end).

ars 2010-10-16 00:33:25

thanks so much for the update, it finds the correct location but doesn't display the result correctly, the output is: at 638.14also I am trying to loop over all the files in the directory and collect all the filtered data sets

gletscher 2010-10-16 00:36:32

Answer 3

+2 A:

If you are going to use Python, then use Python. Why are you intermixing bash commands together? It makes your code not portable/dependent on a bash environment.

import os
import glob
import operator
os.chdir("/mypath")
for file in glob.glob("*.csv"):
    data=open(file).readlines()
    data=[i.strip().split(";") for i in data if i[:3] in ["637","638","639"]]
    # data=[i.strip().split(";") for i in data if i[:3] in ["637","638","639"] and isinstance(float(i[:6]),float) ]
    sortedlist = sorted(data, key=operator.itemgetter(1), reverse=True)
    print "Highest for file %s: %s" % (file,sortedlist[0])

OR, if you are more interested in a bash+tools solution

find . -type f -name '*.csv' |while read -r FILE
do
 grep -a "63[789]\...;" "$FILE" | sort -n -r -t ';' -k 2 | head -1  >> output.txt
done

ghostdog74 2010-10-16 00:09:37

thanks that's a very nice script, but the filter for 637, 638 and 639 doesn't check the regexp \...; is that easily possible with python? what I just notice when running it are the "" around file in the data=open line.. thanks again I really like this snippet

gletscher 2010-10-16 00:30:19

If you really want to check using regex, you can use the `re` module. Otherwise, you can just simply check if its a float. see my edit.

ghostdog74 2010-10-16 00:54:19

+1, if you think you need frankenscripts, you probably don't know either of the environments (bash or python) well enough. I'm often guilty of this.

Thomas 2010-10-16 14:42:33

does the find command have any advantage to a simple ls? like for i in $( ls *.csv ); do grep -a "63[789]\...;" $i | sort -r -n -t ';' -k2 | head -n 1done

gletscher 2010-10-17 07:41:45

firstly, don't do `(ls *csv)`. This is called useless use of `ls`. Use shell expansion -> `for file in *csv`. Secondly, `find` and `ls` serve different purpose and `find` has a whole lot more features. `find` finds files recursively. a single `ls` do not , unless with `-R` flag

ghostdog74 2010-10-17 07:56:13

Answer 4

A:

nice, thanks a lot, Hakop Palyan !!

Now is there a trick on how to get this data set out of all the csv files and collect it somewhere as a new file? something like

 find . -name '*.csv' -print0 | xargs -0 grep -a "63[789]\...;" | sort -n -r -t ';' -k 2 | head --lines=1

this one prints only the first line, I would need to iterate over the individual files and collect the datasets ...

gletscher 2010-10-16 00:13:29

You should either ask this as a separate question or update your original question.

istruble 2010-10-16 05:52:28

Answer 5

+1 A:

If you have a ton of data then you don't want to store all that data into memory and then sort it to get the max value. This approach is inefficient regarding both computing time complexity and memory.

You can simply parse the files and compute the desired values on-the-fly instead. A fast pure Python approach to deal with your problem:

import os, re
os.chdir('/path/to/csvdir')
for f in os.listdir('.'):
    dataset, count = 0.0, 0.0
    for line in open(f):
        if re.search(r'63[6789]\...', line):
            d, c = map(float, line.strip().split(';'))
            if count < c:
                dataset, count = d, c
    print f, dataset

This approach can also be used to show a list of max values (if there can be more than one dataset with highest counts) by modifying the appropriate lines:

dataset, count = [], 0.0
...
        if count < c:
            dataset, count = [d], c
        elif count == c:
            dataset.append(d)

Edit: the script assumes that your csvdir is populated only with files containing the parsing format. If you want to filter them by name, you can use either glob (with limited regexp capabilities in name filtering):

for f in glob.glob('*.csv'):

or apply a filter to os.listdir:

for f in filter(lambda f: re.match('.*\.csv', f), os.listdir('.')):

amadaeus 2010-10-16 00:33:56

hey that's a super nice script, although the script runs into errors if other files are present in the folder, so using ghostdogs74 glob.glob("*.csv"): instead of os.listdir(".") seems to work better in this case, thanks so much

gletscher 2010-10-16 00:54:46

thanks for the regexp filter

gletscher 2010-10-17 06:33:01

Answer 6

+1 A:

Here is code I wrote to sort csv files using python. It allows you to specify multiple columns and to sort in reverse order by using a minus sign.

#!/usr/bin/env python
# Usage:
# (1) sort ctb_consolidated_test_id.csv by Academic Year, Test ID, Period, and Test Name, with Test ID in descending order
#   sort_csv.py -c "Academic Year" -c "-Test ID" -c "Period" -c "Test Name" ctb_consolidated_test_id.csv
from __future__ import with_statement
from __future__ import print_function

import sys

def multikeysort(items, columns):
    from operator import itemgetter
    import re
    num_re = re.compile(r'^\d+$')
    comparers = [
        ((itemgetter(col[1:].strip()), -1) if col.startswith('-') else (itemgetter(col.strip()), 1))
        for col in columns
    ]
    def number_comparable(val1, val2):
        return len(val1) != len(val2) and num_re.match(val1) and num_re.match(val2)
    def column_comparer(left, right):
        for fn, mult in comparers:
            val1, val2 = fn(left), fn(right)
            if number_comparable(val1, val2):
                val1, val2 = int(val1), int(val2)
            result = cmp(val1, val2)
            if result:
                return mult * result
        return 0
    return sorted(items, cmp=column_comparer)

def sort_csv(filename, columns):
    import csv
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        writer = csv.DictWriter(sys.stdout, reader.fieldnames)
        writer.writerow(dict(zip(reader.fieldnames, reader.fieldnames)))
        writer.writerows(multikeysort(reader, columns))

if __name__ == '__main__':
    from glob import glob
    from optparse import OptionParser, make_option
    option_list = [
        make_option('-c', '--column', dest='columns', action='append', metavar='COLUMN NAME'),
    ]
    parser = OptionParser(option_list=option_list)
    (options, args) = parser.parse_args()
    filenames = (filename for arg in args for filename in glob(arg))
    for filename in filenames:
        sort_csv(filename, options.columns)

hughdbrown 2010-10-16 01:58:48

Answer 7

A:

I know you are looking for a bash based solution but I could not help offering something using the csv module.

import os
import csv
import re

target_re = re.compile(r'^63[789]\.\d\d$')
csv_filenames = [f for f in os.listdir('.') if f.endwith('.csv')]
largest_in_each_file = []

for f in csv_filenames:
    largest = (None, 0)
    for a,b in csv.reader(open(f, 'rb'), delimiter=';'):
        if target_re.match(a) and b > largest[1]:
            largest = (a, b)
    largest_in_each_file.append(largest)


largest_overall = largest_in_each_file[0]
for largest in largest_in_each_file:
    print "%s;%s in %s" % largest
    if largest[1] > largest_overall[1]:
        largest_overall = largest

print "-" * 10
print "%s;%s in %s is the largest record in all files" % largest_overall

istruble 2010-10-16 02:32:07

ansaurus

tags:

views:

answers:

Sorting csv columns in bash, reading bash output into python variables..

related questions