Hello, I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?
Thanks!
Hello, I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?
Thanks!
I'm pretty sure there is, as I've even been able to edit/read from the source files of scripts I've run, but the biggest problem would probably be all the shifting that would be done if you started at the beginning of the file. On the other hand, if you go through the file and record all the starting positions of the lines, you could then go in reverse order of position to copy the lines out; once that's done, you could go back, take the new files, one at a time, and (if they're small enough), use readlines() to generate a list, reverse the order of the list, then seek to the beginning of the file and overwrite the lines in their old order with the lines in their new one.
(You would truncate the file after reading the first block of lines from the end by using the truncate()
method, which truncates all data past the current file position if used without any arguments besides that of the file object, assuming you're using one of the classes or a subclass of one of the classes from the io
package to read your file. You'd just have to make sure that the current file position ends up at the beginning of the last line to be written to a new file.)
EDIT: Based on your comment about having to make the separations at the proper closing tags, you'll probably also have to develop an algorithm to detect such tags (perhaps using the peek
method), possibly using a regular expression.
If you're on Linux/Unix, why not use the split command like this guy does?
split --bytes=100m /input/file /output/dir/prefix
EDIT: then use csplit.
If time is not a major factor (or wear and tear on your disk drive):
If Python does not give you this level of control, you may need to dive into C.
You could always parse the XML file and write out say every 10000 elements to there own file. Look at the Incremental Parsing section of this link. http://effbot.org/zone/element-iterparse.htm
Say you want to split the file into N pieces, then simply start reading from the back of the file (more or less) and repeatedly call truncate:
Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position. The current file position is not changed. ...
import os
import stat
BUF_SIZE = 4096
size = os.stat("large_file")[stat.ST_SIZE]
chunk_size = size / N
# or simply set a fixed chunk size based on your free disk space
c = 0
in_ = open("large_file", "r+")
while size > 0:
in_.seek(-min(size, chunk_size), 2)
# now you have to find a safe place to split the file at somehow
# just read forward until you found one
...
old_pos = in_.tell()
with open("small_chunk%2d" % (c, ), "w") as out:
b = in_.read(BUF_SIZE)
while len(b) > 0:
out.write(b)
b = in_.read(BUF_SIZE)
in_.truncate(old_pos)
size = old_pos
c += 1
Be careful, as I didn't test any of this. It might be needed to call flush
after the truncate call, and I don't know how fast the file system is going to actually free up the space.
Its a time to buy a new hard drive!
You can make backup before trying all other answers and don't get data lost :)
Here is my script...
import string
import os
from ftplib import FTP
# make ftp connection
ftp = FTP('server')
ftp.login('user', 'pwd')
ftp.cwd('/dir')
f1 = open('large_file.xml', 'r')
size = 0
split = False
count = 0
for line in f1:
if not split:
file = 'split_'+str(count)+'.xml'
f2 = open(file, 'w')
if count > 0:
f2.write('<?xml version="1.0"?>\n')
f2.write('<StartTag xmlns="http://www.blah/1.2.0">\n')
size = 0
count += 1
split = True
if size < 1073741824:
f2.write(line)
size += len(line)
elif str(line) == '</EndTag>\n':
f2.write(line)
f2.write('</EndEndTag>\n')
print('completed file %s' %str(count))
f2.close()
f2 = open(file, 'r')
print("ftp'ing file...")
ftp.storbinary('STOR ' + file, f2)
print('ftp done.')
split = False
f2.close()
os.remove(file)
else:
f2.write(line)
size += len(line)