ansaurus

Question

how to load a big file and cut it into smaller files ??

Answer 1

A:

I'm not going to get into the actual code, but pseudo code would do this.

BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
   write(SMALLFILE, line)
   if(line matches pattern) {
      SMALLFILE="smallfile++"
   }
}

Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.

plor 2010-06-22 15:34:09

Answer 2

+1 A:

A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)

Vinko Vrsalovic 2010-06-22 15:34:32

Answer 3

+2 A:

I would first read all of the allegedly-big file in memory as a list of lines:

with open('socalledbig.txt', 'rt') as f:
    lines = f.readlines()

should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.

Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).

Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.

Then, lastly:

assert len(starts) == len(ends) == len(names)

for s, e, n in zip(starts, ends, names):
    with open(n, 'wt') as f:
        f.writelines(lines[s:e])

...and that's all you need to do!

Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on. It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.

A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)

outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
  if 'begin' in line:
    if 'end' in line:
      raise ValueError('Both begin and end: %r' % line)
    starts.append(i)
    names.append('result%d.txt' % outfile)
    outfile += 1
  elif 'end' in line:
    ends.append(i + 1)  # remember ends are EXCLUDED, hence the +1

That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.

As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.

Alex Martelli 2010-06-22 16:15:04

there wont be overlap in the groups to write but may have to leave one line gap in between..can you please be more clear in telling what is start,ends and names in the above code..

kaki 2010-06-22 16:24:12

@kaki, OK, I thought I was crystal clear but I'll add a totally bogus example (since we don't know the details at all) to try and help you further -- let me edit this.

Alex Martelli 2010-06-22 16:38:20

thnq u for the reply

kaki 2010-06-22 17:26:52

@kaki, you're welcome.

Alex Martelli 2010-06-22 20:04:34

ansaurus

tags:

views:

answers:

how to load a big file and cut it into smaller files ??

related questions