ansaurus

Question

How do I split a huge text file in python

Answer 1

+8 A:

Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)

Kamil Kisiel 2008-11-14 23:18:32

Thanks for the answer - your suggestions are working well so far for reading the file. When I've finished, I'll also try a binary version that doesn't read one line at a time.

quamrana 2008-11-15 20:04:04

What is wrong with `os.path.getsize(filename)`?

J.F. Sebastian 2008-11-16 18:02:57

Answer 2

+1 A:

You can use wc and split (see the respective manpages) to get the desired effect. In bash:

split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.

produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.

Svante 2008-11-15 00:11:38

Yes, it is not Python, but why use a screwdriver to apply a nail?

Svante 2008-11-16 01:05:56

Well it's not really a screwdriver vs. nail... python often is a great way to accomplish simple tasks such as this. And I don't want to bash bash (pun intended) but that is not really... readable :)

Agos 2010-02-04 23:22:53

It is very readable, you just need to know the language.

Svante 2010-02-05 21:28:50

Answer 3

A:

Or, a python version of wc and split:

lines = 0
for l in open(filename): lines += 1

Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

Claudiu 2008-11-15 18:05:32

No need to keep the count manually, use enumerate:for l, line in enumerate(open(filename)):...

Matthew Trevor 2008-11-16 08:55:36

Answer 4

A:

I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.

def Split(inputFile,numParts,outputName):
    fileSize=os.stat(inputFile).st_size
    parts=FileSizeParts(fileSize,numParts)
    openInputFile = open(inputFile, 'r')
    outPart=1
    for part in parts:
        if openInputFile.tell()<fileSize:
            fullOutputName=outputName+os.extsep+str(outPart)
            outPart+=1
            openOutputFile=open(fullOutputName,'w')
            openOutputFile.writelines(openInputFile.readlines(part))
            openOutputFile.close()
    openInputFile.close()
    return outPart-1

quamrana 2008-11-16 20:54:21

Answer 5

+2 A:

linux has a split command

split -l 100000 file.txt

would split into files of equal 100,000 line size

2010-02-04 22:42:29

Answer 6

+1 A:

don't forget seek() and mmap() for random access to files.

def getSomeChunk(filename, start, len):
    fobj = open(filename, 'r+b')
    m = mmap.mmap(fobj.fileno(), 0)
    return m[start:start+len]

Joe Koberg 2010-02-04 22:53:33

Answer 7

A:

This generator method is a (slow) way to get a slice of lines without blowing up your memory.

def slicefile(filename, start, end):
    for i, line in enumerate(open(filename)):
        if i >= end:
            return
        if start <= i:
            yield line

out = open("/blah.txt", "w")
for line in slicefile("/python26/readme.txt", 10, 15):
    out.write(line)

Ryan Ginstrom 2010-02-04 23:42:15

ansaurus

tags:

views:

answers:

How do I split a huge text file in python

related questions