tags:

views:

178

answers:

5

I have a folder with 100k text files. I want to put files with over 20 lines in another folder. How do I do this in python? I used os.listdir, but of course, there isn't enough memory for even loading the filenames into memory. Is there a way to get maybe 100 filenames at a time?

Here's my code:

import os
import shutil

dir = '/somedir/'

def file_len(fname):
    f = open(fname,'r')
    for i, l in enumerate(f):
        pass
    f.close()
    return i + 1

filenames = os.listdir(dir+'labels/')

i = 0
for filename in filenames:
    flen = file_len(dir+'labels/'+filename)
    print flen
    if flen > 15:
        i = i+1
        shutil.copyfile(dir+'originals/'+filename[:-5], dir+'filteredOrigs/'+filename[:-5])
print i

And Output:

Traceback (most recent call last):
  File "filterimage.py", line 13, in <module>
    filenames = os.listdir(dir+'labels/')
OSError: [Errno 12] Cannot allocate memory: '/somedir/'

Here's the modified script:

import os
import shutil
import glob

topdir = '/somedir'

def filelen(fname, many):
    f = open(fname,'r')
    for i, l in enumerate(f):
        if i > many:
            f.close()
            return True
    f.close()
    return False

path = os.path.join(topdir, 'labels', '*')
i=0
for filename in glob.iglob(path):
    print filename
    if filelen(filename,5):
        i += 1
print i

it works on a folder with fewer files, but with the larger folder, all it prints is "0"... Works on linux server, prints 0 on mac... oh well...

A: 
import os,shutil
os.chdir("/mydir/")
numlines=20
destination = os.path.join("/destination","dir1")
for file in os.listdir("."):
    if os.path.isfile(file):
        flag=0
        for n,line in enumerate(open(file)):
            if n > numlines: 
                flag=1
                break
        if flag:
            try:
                shutil.move(file,destination) 
            except Exception,e: print e
            else:
                print "%s moved to %s" %(file,destination)
ghostdog74
That's the basic task cseric is trying to accomplish, but it's not an answer to his question.
jcdyer
yes it is. He asked how to put files with over 20 lines to another folder using Python.
ghostdog74
No, he asked how to do it for a directory that had 100.000 files, noting that calling os.listdir("."), as you do, means he runs out of memory.
Lennart Regebro
I do not have a problem loading 100k files using os.listdir.
ghostdog74
That may be, but he specifically said that running os.listdir over everything isn't working for him.
jcdyer
+2  A: 

A couple thoughts. First, you might use the glob module to get smaller groups of files. Second, sorting by line count is going to be very time consuming, as you have to open every file and count lines. If you can partition by byte count, you can avoid opening the files by using the stat module. If it's crucial that the split happens at 20 lines, you can at least cut out large swaths of files by figuring out a minimum number of characters that a 20 line file of your type will have, and not opening any file smaller than that.

jcdyer
+4  A: 

you might try using glob.iglob that returns an iterator:

topdir = os.path.join('/somedir', 'labels', '*')
for filename in glob.iglob(topdir):
     if filelen(filename) > 15:
          #do stuff

Also, please don't use dir for a variable name: you're shadowing the built-in.

Another major improvement that you can introduce is to your filelen function. If you replace it with the following, you'll save a lot of time. Trust me, what you have now is the slowest alternative:

def many_line(fname, many=15):
    for i, line in enumerate(open(fname)):
        if i > many:
            return True
    return False
SilentGhost
Did anybody read the `many_line` function before hitting the upvote button???
John Machin
@John: can anyone here distinguish typo from the real issue?
SilentGhost
@Silent: +1 fattest typo of the year award
John Machin
A: 

how about using a shell script? you could pick one file at a time:

for f in `ls`;
loop
if `wc -l f`>20; then
  mv f newfolder
fi
end loop

ppl please correct if i am wrong in any way

Aadith
don't use ls with for loop like that. It breaks on files with spaces .use shell expansion.
ghostdog74
A: 

The currently accepted answer just plain doesn't work. This function:

def many_line(fname, many=15):
    for i, line in enumerate(line):
        if i > many:
            return True
    return False

has two problems: Firstly, the fname arg is not used and the file is not opened. Secondly, the call to enumerate(line) will fail because line is not defined.

Changing enumerate(line) to enumerate(open(fname)) will fix it.

John Machin