views:

126

answers:

1

Well, this could be a simple question, to be frank I'm a little confused with encodings an all those things.

Let's suppose I have the file 01234.txt which is iso-8859-1.

When I do:

iconv --from-code=iso-8859-1 --to-code=utf-8 01234.txt > 01234_utf8.txt

It gives me the desired result, but when I do the same thing with python and using subprocess:

import subprocess

p0 = subprocess.Popen([<here the same command>], shell=True)
p0.wait()

I get almost the same result, but the new file is missing e.g. part of the line before the last one and the last one.

Here the last three lines of both files: iconv result:

795719000|MARIA TERESA MARROU VILLALOBOS|107
259871385|CHRISTIAM ALBERTO SUAREZ VILLALOBOS|107
311015100|JORGE MEZA CERVANTES|09499386

python result:

795719000|MARIA TERESA MARROU VILLALOBOS|107
259871385|CHRISTIAM

EDIT: In the python file I've tried using coding: utf-8 and coding: iso-8859-1 (not both at the same time).

EDIT: I've used codecs in bpython it works great. When using it from a file I get the not desired result.

EDIT: I'm using linux (Ubuntu 9.10) and python 2.6.2.

Any suggestions?

+1  A: 

You wrote: "In the python file I've used coding: utf-8 and coding: iso-8859-1."

Only the first of those will be used. Secondly, that specifies the encoding of the Python source file in which it appears, so that the Python compiler can do its job. Consequently it is absolutely nothing to do with the encodings of your input file and output file. A script to transcode data from encoding X to encoding Y can be written using only ASCII characters.

Now to your problem:

You wrote: "p0 = subprocess.Popen([<here the same command>], shell=True)"

Please (always) when asking a question, show the EXACT code that was run, not what you hoped/thought was run. Use copy/paste, don't retype it. Don't try to put it in a comment; edit your question.

Update: Here is a GUESS, based on the symptoms: you are losing the last few bytes of a file -- looks like failure to flush a buffer before fading away. Is the size of the truncated output file an integral power of 2?

Perhaps you should not rely on the command line processor doing > 01234_utf8.txt reliably. If you omit that part of the command, does the full payload appear on stdout? If, so you may be able to work around the problem by opening the output file yourself, passing its handle as the stdout arg, and later doing handle.flush() and handle.close().

John Machin
When I said I used both codings I wanted to mean that I used them once each one (e.g. I tried the first time with utf8 and the second time with iso-8859-1) not both at the same time. My bad for not explaining it in the right way.
snahor
On to the real problem: Please also edit your question to say whether you are on Unix or Windows --- highly relevant when using subprocess.Popen()
John Machin