So I have these giant XML files (and by giant, I mean like 1.5GB+) and they don't have CRLFs. I'm trying to run a diff-like program to find the differences between these files.
Since I've yet to find a diff program that won't explode due to memory exhaustion, I've decided the best bet was to add CRLFs after closing tags.
I wrote a python script to read char-by-char and add new-lines after '>'. The problem is I'm running this on a single core PC circa-1995 or something ridiculous, and it's only processing about 20MB/hour when I have both converting at the same time.
Any idea if writing this in C#/C/C++ instead will yield any benefits? If not, does anyone know of a diff program that will go byte-by-byte? Thanks.
EDIT:
Here's the code for my processing function...
def read_and_format(inputfile, outputfile):
''' Open input and output files, then read char-by-char and add new lines after ">" '''
infile = codecs.open(inputfile,"r","utf-8")
outfile = codecs.open(outputfile,"w","utf-8")
char = infile.read(1)
while(1):
if char == "":
break
else:
outfile.write(char)
if(char == ">"):
outfile.write("\n")
char = infile.read(1)
infile.close()
outfile.close()
EDIT2: Thanks for the awesome responses. Increaseing the read size created an unbelievable speed increase. Problem solved.