views:

3447

answers:

7

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.

What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.

Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?

I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)

+8  A: 

Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)

Kamil Kisiel
Thanks for the answer - your suggestions are working well so far for reading the file. When I've finished, I'll also try a binary version that doesn't read one line at a time.
quamrana
What is wrong with `os.path.getsize(filename)`?
J.F. Sebastian
+1  A: 

You can use wc and split (see the respective manpages) to get the desired effect. In bash:

split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.

produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.

Svante
Yes, it is not Python, but why use a screwdriver to apply a nail?
Svante
Well it's not really a screwdriver vs. nail... python often is a great way to accomplish simple tasks such as this. And I don't want to bash bash (pun intended) but that is not really... readable :)
Agos
It is very readable, you just need to know the language.
Svante
A: 

Or, a python version of wc and split:

lines = 0
for l in open(filename): lines += 1

Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

Claudiu
No need to keep the count manually, use enumerate:for l, line in enumerate(open(filename)):...
Matthew Trevor
A: 

I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.

def Split(inputFile,numParts,outputName):
    fileSize=os.stat(inputFile).st_size
    parts=FileSizeParts(fileSize,numParts)
    openInputFile = open(inputFile, 'r')
    outPart=1
    for part in parts:
        if openInputFile.tell()<fileSize:
            fullOutputName=outputName+os.extsep+str(outPart)
            outPart+=1
            openOutputFile=open(fullOutputName,'w')
            openOutputFile.writelines(openInputFile.readlines(part))
            openOutputFile.close()
    openInputFile.close()
    return outPart-1
quamrana
+2  A: 

linux has a split command

split -l 100000 file.txt

would split into files of equal 100,000 line size

+1  A: 

don't forget seek() and mmap() for random access to files.

def getSomeChunk(filename, start, len):
    fobj = open(filename, 'r+b')
    m = mmap.mmap(fobj.fileno(), 0)
    return m[start:start+len]
Joe Koberg
A: 

This generator method is a (slow) way to get a slice of lines without blowing up your memory.

def slicefile(filename, start, end):
    for i, line in enumerate(open(filename)):
        if i >= end:
            return
        if start <= i:
            yield line

out = open("/blah.txt", "w")
for line in slicefile("/python26/readme.txt", 10, 15):
    out.write(line)
Ryan Ginstrom