tags:

views:

118

answers:

2

I want to do this in python but I'm stumped. I wont be able to load the whole file into ram without things becoming unstable, so I want to read it line by line... Any advice would be appreciated.

+3  A: 

One idea could be the following:

import itertools

with open('the1gfile.txt') as inf:
  for i in itertools.count():
    with open('outfile%d.txt' % i, 'w') as ouf:
      for linenum, line in enumerate(inf):
        ouf.write(line)
        if linenum == 99999: break
      else:
        break

The with statement requires Python 2.6 or better, or 2.5 with a from __future__ import with_statement at the top of the module (that's the reason I'm using old-fashioned string formatting to make the output file names -- the new style wouldn't work in 2.5, and you don't tell us what Python version you want to use -- substitute the new style formatting if your Python version supports it, of course;-).

itertools.count() yields 0, 1, 2, ... and so on, with no limit (that loop is terminated only when the conditional break at the very end finally executes).

for linenum, line in enumerate(inf): reads one line at a time (with some buffering for speed) and sets linenum to 0, 1, 2, ... and so on - and we break off that loop after 100,000 lines (next time, the for loop will continue reading exactly where this one left off).

The for loop's else: clause executes if and only if the break within that loop didn't, therefore, if we've read less than 100,000 lines -- i.e., when the input file is finished. Note that there will be one empty output file if the number of lines in the input file is an exact multiple of 100,000.

I hope this makes every part of the mechanism sufficiently clear for you...?

Alex Martelli
+15  A: 

If you do absolutely need to split the file, why not just use the *nix split utility?

http://ss64.com/bash/split.html

split -l 100000 inputfile
Amber
+1 for the right tool for the job.
paxdiablo
because reading a 1G file into memory all at once wouldn't make *nix box 'unstable'. +1 anyways for highlighting the difference between a toy OS and a real OS.
aaronasterling
`split` doesn't read the entire file into memory at once - it's stream-based.
Amber
@Amber I was implying that the OP is using windows. If the OP was using *nix, there would have been no reference to 'unstable'
aaronasterling
If you are on Windows, you're _still_ better off using split: http://gnuwin32.sourceforge.net/packages/coreutils.htm
paxdiablo
Right tool for the job... But it's *not* Python (albeit callable in Python), and it's not portable.
Beau Martínez