tags:

views:

60

answers:

3

I have file about 4MB (which i called as big one)...this file has about 160000 lines..in a specific format...and i need to cut them at regular interval(not at equal intervals) i.e at the end of a certain format and write the part into another file..

Basically,what i wanted is to copy the information for the big file into the many smaller files ...as i read the big file keep writing the information into one file and after the a certain pattern occurs then end this and starting writing for that line into another file...

Normally, if it is a small file i guess it can be done dont know if i can perform file.readline() to read each line check if pattern end if not then write it to a file if patter end then change the file name open new file..so on but how to do it for this big file..

thanks in advance..

didnt mention the file format as i felt it is not neccesary will mention if required..

A: 

I'm not going to get into the actual code, but pseudo code would do this.

BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
   write(SMALLFILE, line)
   if(line matches pattern) {
      SMALLFILE="smallfile++"
   }
}

Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.

plor
+1  A: 

A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)

Vinko Vrsalovic
+2  A: 

I would first read all of the allegedly-big file in memory as a list of lines:

with open('socalledbig.txt', 'rt') as f:
    lines = f.readlines()

should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.

Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).

Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.

Then, lastly:

assert len(starts) == len(ends) == len(names)

for s, e, n in zip(starts, ends, names):
    with open(n, 'wt') as f:
        f.writelines(lines[s:e])

...and that's all you need to do!

Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on. It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.

A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)

outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
  if 'begin' in line:
    if 'end' in line:
      raise ValueError('Both begin and end: %r' % line)
    starts.append(i)
    names.append('result%d.txt' % outfile)
    outfile += 1
  elif 'end' in line:
    ends.append(i + 1)  # remember ends are EXCLUDED, hence the +1

That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.

As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.

Alex Martelli
there wont be overlap in the groups to write but may have to leave one line gap in between..can you please be more clear in telling what is start,ends and names in the above code..
kaki
@kaki, OK, I thought I was crystal clear but I'll add a totally bogus example (since we don't know the details at all) to try and help you further -- let me edit this.
Alex Martelli
thnq u for the reply
kaki
@kaki, you're welcome.
Alex Martelli