tags:

views:

132

answers:

6

Hi,

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:

ccc245 1 4 5 5 3 -12.3

For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3

What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:

FILE  1st_col  7th_col

in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6

I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:

  1. concatenate all in_ files in IN.TXT
  2. search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
  3. given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file

It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.

Thanks

+1  A: 

Try something like this in python:

min_values = []

def add_to_min(file_name, one, seven):
    # checks to see if 7th column is a lower value than exiting values
    if len(min_values) == 0 or seven < max(min_values)[0]:
        # let's remove the biggest value
        min_values.sort()
        if len(min_values) != 0:
            min_values.pop()
        # and add the new value tuple
        min_values.append((seven, file_name, one))

# loop through all the files
for file_name in os.listdir(<dir>):
    f = open(file_name)
    for line in file_name.readlines():
        columns = line.split()
        add_to_min(file_name, columns[0], float(columns[6]))

# print answers
for (seven, file_name, one) in min_values:
    print file_name, one, seven

Haven't tested it, but it should get you started.

Version 2, just runs the sort a single time (after a prod by S. Lott):

values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
    f = open(file_name)
    for line in file_name.readlines():
        columns = line.split()
        values.append((file_name, columns[0], float(columns[6]))

# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
    print file_name, one, seven

Just re-read you question, with millions of rows, you might run out of RAM....

Sam Doshi
thanks i am going to check it now
asdf
That `min_values.sort()` is very bothersome. I think this can be reworked to remove that needless sort. Yes, it's small, but it will be performed millions of times and doesn't seem to help.
S.Lott
Maybe, but we need to remove the biggest value with the pop, so min_values needs to be sorted. You probably could do it more efficiently but sometimes simpler is better than faster, especially if the code is only going to be run a few times. Let me see if I can come up with something better.
Sam Doshi
I think that you don't need to sort to pop the max of 10 minimum values plus one unknown value. I think you can do something like `min_values.remove( max(min_values+[seven]) )` to preserve the smallest value.
S.Lott
+2  A: 

I would:

  • take first 10 items,
  • sort them and then
  • for every line read from files insert the element into those top10:
    • in case its value is lower than highest one from current top10,
    • (keeping the sorting for performance)

I wouldn't post the complete program here as it looks like homework.

Yes, if it wasn't ten, this would be not optimal

Antony Hatchkins
A: 

A small improvement of your shell solution:

$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
  cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8
with million rows in the files, using bash's while/read, + external commands would be slow on performance.
ghostdog74
+1  A: 

If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.

  for file in in_*.txt
    do
        awk -vt=$t 'NR<=10{
            c=c+1
            val[c]=$7
            tag[c]=$1
        }
        NR>10{
            for(o=1;o<=c;o++){
                if ( $7 <= val[o] ){
                    val[o]=$7
                    tag[o]=$1
                    break
                }
            }
        }
        END{
            for(i=1;i<=c;i++){
                print val[i], tag[i] | "sort"
            }

        }' $file
    done
ghostdog74
"to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value"That's not what your script does. It prints the smallest value for *every* file.
A: 

This might be close to what you're looking for:

for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done

cat *.out | sort -k3n | head -n 10 > final_result.out
Dennis Williamson
This gets the ten smallest values from every file and sorts the resulting list of `n*10` values. That's not what was asked for.
Oops, forgot a `head` command. Fixed. Thanks for the downvote.
Dennis Williamson
+3  A: 

In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.

Example (tested) for Python 2.5 and 2.6:

import heapq, glob

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield fname, items[0], float(items[6])
        f.close()

result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result

Update after above answer accepted

Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).

Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.

import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield -float(items[6]), fname, items[0]
        f.close()

def homegrown_nlargest(n, iterable):
    """Ensures heap never has more than n entries"""
    heap = []
    for item in iterable:
        if len(heap) < n:
            heappush(heap, item)
        else:
            heappushpop(heap, item)
    return heap

result =  homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)
John Machin
hi, thankswhile trying your solution i get the error:Traceback (most recent call last): File "filter.py", line 13, in <module> result = heapq.nsmallest(10, my_iterable(), lambda x: x[2]) File "/usr/lib64/python2.6/heapq.py", line 359, in nsmallest result = _nsmallest(n, it) File "filter.py", line 10, in my_iterable yield fname, items[0], float(items[5])IndexError: list index out of range
asdf
That means that you have a line that doesn't have enough fields in it. You are responsible for checking that your data is actually in the format that you believe it's in. By the way, why do you have `items[5]` and not `items[6]` in the yield statement?
John Machin
thanks a lot, now it works as expected!I realize that I got some format exceptions on the file that, when checked, give the right resultsI will try now to understand your codethanksJose
asdf