tags:

views:

568

answers:

7

Hello, I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?

Thanks!

+1  A: 

I'm pretty sure there is, as I've even been able to edit/read from the source files of scripts I've run, but the biggest problem would probably be all the shifting that would be done if you started at the beginning of the file. On the other hand, if you go through the file and record all the starting positions of the lines, you could then go in reverse order of position to copy the lines out; once that's done, you could go back, take the new files, one at a time, and (if they're small enough), use readlines() to generate a list, reverse the order of the list, then seek to the beginning of the file and overwrite the lines in their old order with the lines in their new one.

(You would truncate the file after reading the first block of lines from the end by using the truncate() method, which truncates all data past the current file position if used without any arguments besides that of the file object, assuming you're using one of the classes or a subclass of one of the classes from the io package to read your file. You'd just have to make sure that the current file position ends up at the beginning of the last line to be written to a new file.)

EDIT: Based on your comment about having to make the separations at the proper closing tags, you'll probably also have to develop an algorithm to detect such tags (perhaps using the peek method), possibly using a regular expression.

JAB
+2  A: 

If you're on Linux/Unix, why not use the split command like this guy does?

split --bytes=100m /input/file /output/dir/prefix

EDIT: then use csplit.

plastic chris
This would not work as I have an xml file. I would need each file to be split at the correct location (after a comple record with closing tags).
Maulin
@Maulin. ouch...makes for an interesting problem though
Jesse
A: 

If time is not a major factor (or wear and tear on your disk drive):

  1. Open handle to file
  2. Read up to the size of your partition / logical break point (due to the xml)
  3. Save the rest of your file to disk (not sure how python handles this as far as directly overwriting file or memory usage)
  4. Write the partition to disk
  5. goto 1

If Python does not give you this level of control, you may need to dive into C.

Jesse
A: 

You could always parse the XML file and write out say every 10000 elements to there own file. Look at the Incremental Parsing section of this link. http://effbot.org/zone/element-iterparse.htm

Jared
+7  A: 

Say you want to split the file into N pieces, then simply start reading from the back of the file (more or less) and repeatedly call truncate:

Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position. The current file position is not changed. ...

import os
import stat

BUF_SIZE = 4096
size = os.stat("large_file")[stat.ST_SIZE]
chunk_size = size / N 
# or simply set a fixed chunk size based on your free disk space
c = 0

in_ = open("large_file", "r+")

while size > 0:
    in_.seek(-min(size, chunk_size), 2)
    # now you have to find a safe place to split the file at somehow
    # just read forward until you found one
    ...
    old_pos = in_.tell()
    with open("small_chunk%2d" % (c, ), "w") as out:
        b = in_.read(BUF_SIZE)
        while len(b) > 0:
            out.write(b)
            b = in_.read(BUF_SIZE)
    in_.truncate(old_pos)
    size = old_pos
    c += 1

Be careful, as I didn't test any of this. It might be needed to call flush after the truncate call, and I don't know how fast the file system is going to actually free up the space.

Torsten Marek
Thanks for all the input. I'll try some of your suggestions tonight.
Maulin
Good luck with that:)
Torsten Marek
Nice detail. I don't do enough Python to be able to pull something like this off the top of my head.
NoMoreZealots
is there some way to truncate the first x number of bytes from a file?Truncate(100) will make the file be at most 100 bytes, how can I delete the first 100 bytes from the file?
Maulin
Thanks for all your help guys. I just took the easy way out and had the script ftp each chunk as it was done to a server that had more space. If I had more time, I would try Torstens approach
Maulin
The only way to delete bytes from the beginning of the file is either to write it totally new, or move everything in-place, i.e. read byte 100, write to 0, read 101, write to 1 etc, and then truncate at the end. Since you have to do that over and over again, you end up with O(n^2).
Torsten Marek
A: 

Its a time to buy a new hard drive!

You can make backup before trying all other answers and don't get data lost :)

Denis Barmenkov
A: 

Here is my script...

import string
import os
from ftplib import FTP

# make ftp connection
ftp = FTP('server')
ftp.login('user', 'pwd')
ftp.cwd('/dir')

f1 = open('large_file.xml', 'r')

size = 0
split = False
count = 0

for line in f1:
  if not split:
    file = 'split_'+str(count)+'.xml'
    f2 = open(file, 'w')
    if count > 0:
      f2.write('<?xml version="1.0"?>\n')
      f2.write('<StartTag xmlns="http://www.blah/1.2.0"&gt;\n')
    size = 0
    count += 1 
    split = True    
  if size < 1073741824:
      f2.write(line)
      size += len(line)
  elif str(line) == '</EndTag>\n':
      f2.write(line)
      f2.write('</EndEndTag>\n')
      print('completed file %s' %str(count))
      f2.close()
      f2 = open(file, 'r')
      print("ftp'ing file...")
      ftp.storbinary('STOR ' + file, f2)
      print('ftp done.')
      split = False
      f2.close()
      os.remove(file)
  else:
    f2.write(line)
    size += len(line)
Maulin