tags:

views:

42

answers:

1

Hi all,

We have a process which takes a very large csv (1.6GB) and breaks it down into pieces (in this case 3). This runs nightly and normally doesn't give us any problems. When it ran last night, however, the first of the output files had lost precision on the numeric fields in the data. The active ingredient in the script are the lines:

while lineCounter <= chunk:
    oOutFile.write(oInFile.readline())
    lineCounter = lineCounter + 1

and the normal output might be something like

StringField1; StringField2; StringField3; StringField4; 1000000; StringField5; 0.000054454

etc.

On this one occasion and in this one output file the numeric fields were all output with 6 zeros at the end i.e.

StringField1; StringField2; StringField3; StringField4; 1000000.000000; StringField5; 0.000000

We are using Python v2.6 (and don't want to upgrade unless we really have to) but we can't afford to lose this data. Does anyone have any idea why this might have happened? If the readline is doing some kind of implicit conversion is there a way to do a binary read, because we really just want this data to pass through untouched?

It is very wierd to us that this only affected one of the output files generated by the same script, and when it was rerun the output was as expected.

thanks

Jack

(readlines method referenced in below thread)

f = open(filename)                   
lines = 0 
buf_size = 1024 * 1024 
read_f = f.read # loop optimization 

buf = read_f(buf_size) 
while buf: 
    lines += buf.count('\n') 
    buf = read_f(buf_size) 

return lines 
+1  A: 

.readline() doesn't do anything with the content of the line, certainly not with numbers, so it's definitely not the culprit.

Thanks for giving more info, but this still looks very mysterious to me as neither function should be causing such a change. You didn't open the output in Excel, by any chance? Sometimes Excel does weird things and interprets stuff in an unexpected way. Grasping at straws here...

(As an aside, I don't see the big optimization potential in read_f = f.read :))

Tim Pietzcker
Thanks Tim, OInFile is the 1.6GB file which lives on a network drive. I didn't mention the only other thing the script does is count the lines in the infile using this methodf = open(filename) lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('\n') buf = read_f(buf_size) return lines I've noticed it doesn't explicitly close the file handle here, surely that can't be the problem?
not formatting that properly i'll try putting in original question
No, the output gets picked up by another python script in what should be a read only buffer, although that isn't something i've checked and is worth looking at...thanks for the ideas, i think i'm going to have to put something in just to check if this happens on future runs :(