ansaurus

Question

Answer 1

+1 A:

Try something like this in python:

min_values = []

def add_to_min(file_name, one, seven):
    # checks to see if 7th column is a lower value than exiting values
    if len(min_values) == 0 or seven < max(min_values)[0]:
        # let's remove the biggest value
        min_values.sort()
        if len(min_values) != 0:
            min_values.pop()
        # and add the new value tuple
        min_values.append((seven, file_name, one))

# loop through all the files
for file_name in os.listdir(<dir>):
    f = open(file_name)
    for line in file_name.readlines():
        columns = line.split()
        add_to_min(file_name, columns[0], float(columns[6]))

# print answers
for (seven, file_name, one) in min_values:
    print file_name, one, seven

Haven't tested it, but it should get you started.

Version 2, just runs the sort a single time (after a prod by S. Lott):

values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
    f = open(file_name)
    for line in file_name.readlines():
        columns = line.split()
        values.append((file_name, columns[0], float(columns[6]))

# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
    print file_name, one, seven

Just re-read you question, with millions of rows, you might run out of RAM....

Sam Doshi 2010-01-12 12:47:31

thanks i am going to check it now

asdf 2010-01-12 12:53:41

That `min_values.sort()` is very bothersome. I think this can be reworked to remove that needless sort. Yes, it's small, but it will be performed millions of times and doesn't seem to help.

S.Lott 2010-01-12 13:25:58

Maybe, but we need to remove the biggest value with the pop, so min_values needs to be sorted. You probably could do it more efficiently but sometimes simpler is better than faster, especially if the code is only going to be run a few times. Let me see if I can come up with something better.

Sam Doshi 2010-01-12 13:47:56

I think that you don't need to sort to pop the max of 10 minimum values plus one unknown value. I think you can do something like `min_values.remove( max(min_values+[seven]) )` to preserve the smallest value.

S.Lott 2010-01-12 14:51:54

Answer 2

+2 A:

I would:

take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
- in case its value is lower than highest one from current top10,
- (keeping the sorting for performance)

I wouldn't post the complete program here as it looks like homework.

Yes, if it wasn't ten, this would be not optimal

Antony Hatchkins 2010-01-12 12:48:18

Answer 3

A:

A small improvement of your shell solution:

$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
  cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8

2010-01-12 12:56:59

with million rows in the files, using bash's while/read, + external commands would be slow on performance.

ghostdog74 2010-01-12 14:23:28

Answer 4

+1 A:

If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.

  for file in in_*.txt
    do
        awk -vt=$t 'NR<=10{
            c=c+1
            val[c]=$7
            tag[c]=$1
        }
        NR>10{
            for(o=1;o<=c;o++){
                if ( $7 <= val[o] ){
                    val[o]=$7
                    tag[o]=$1
                    break
                }
            }
        }
        END{
            for(i=1;i<=c;i++){
                print val[i], tag[i] | "sort"
            }

        }' $file
    done

ghostdog74 2010-01-12 13:06:57

"to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value"That's not what your script does. It prints the smallest value for *every* file.

2010-01-12 13:08:59

Answer 5

A:

This might be close to what you're looking for:

for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done

cat *.out | sort -k3n | head -n 10 > final_result.out

Dennis Williamson 2010-01-12 13:08:23

This gets the ten smallest values from every file and sorts the resulting list of `n*10` values. That's not what was asked for.

2010-01-12 13:12:18

Oops, forgot a `head` command. Fixed. Thanks for the downvote.

Dennis Williamson 2010-01-12 13:43:51

Answer 6

+3 A:

In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.

Example (tested) for Python 2.5 and 2.6:

import heapq, glob

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield fname, items[0], float(items[6])
        f.close()

result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result

Update after above answer accepted

Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).

Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.

import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield -float(items[6]), fname, items[0]
        f.close()

def homegrown_nlargest(n, iterable):
    """Ensures heap never has more than n entries"""
    heap = []
    for item in iterable:
        if len(heap) < n:
            heappush(heap, item)
        else:
            heappushpop(heap, item)
    return heap

result =  homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)

John Machin 2010-01-12 13:31:24

hi, thankswhile trying your solution i get the error:Traceback (most recent call last): File "filter.py", line 13, in <module> result = heapq.nsmallest(10, my_iterable(), lambda x: x[2]) File "/usr/lib64/python2.6/heapq.py", line 359, in nsmallest result = _nsmallest(n, it) File "filter.py", line 10, in my_iterable yield fname, items[0], float(items[5])IndexError: list index out of range

asdf 2010-01-12 14:44:42

That means that you have a line that doesn't have enough fields in it. You are responsible for checking that your data is actually in the format that you believe it's in. By the way, why do you have `items[5]` and not `items[6]` in the yield statement?

John Machin 2010-01-12 14:57:23

thanks a lot, now it works as expected!I realize that I got some format exceptions on the file that, when checked, give the right resultsI will try now to understand your codethanksJose

asdf 2010-01-12 15:16:36

ansaurus

tags:

views:

answers:

Search and sort data from several files

related questions