When I invoke a Python 3 script from a Windows batch.cmd, a UTF-8 arg is not passed as "UTF-8", but as a series of bytes, each of which are interpreted by Python as individual UTF-8 chars.
How can I convert the Python 3 arg string to its intended UTF-8 state?
The calling .cmd and the called .py are shown below.
PS. As I mention in a comment below, calling u00FF.py "ÿ" directly from the Windows console commandline works fine. It is only a problem when I invoke u00FF.cmd via the .cmd, and I am looking for a Python 3
way to convert the double-encoded UTF-8 arg back to a "normally" encoded UTF-8 form.
I've now include here, the full (and latest) test code.. Its a bit long, but I hope it explains the issue clearly enough.
Update: I've seen why the file read of "ÿ" was "double-encoding"... I was reading the UTF-8 file in binary/byte mode... I should have used codecs.open('u00FF.arg', 'r', 'utf-8')
instead of just plain open('u00FF.arg','r')
... I've updated the offending code, and the output. The codepage issues seems to be the only problem now...
Because the Python issue has been largely resolved, and the codepage issue is quite independent of Python, I have posted another codepage specific question at
Codepage 850 works, 65001 fails! There is NO response to “call foo.cmd”. internal commands work fine.
::::::::::::::::::: BEGIN .cmd BATCH FILE ::::::::::::::::::::
:: Windows Batch file (UTF-8 encoded, no BOM): "u00FF.cmd"
@echo ÿ>u00FF.arg
@u00FF.py "ÿ"
@goto :eof
::::::::::::::::::: END OF .cmd BATCH FILE ::::::::::::::::::::
################### BEGIN .py SCRIPT #####################################
# -*- coding: utf-8 -*-
import sys
print ("""
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \\xC3 \\xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
""")
print("## ====================================================")
print("## ÿ via hard-coding in this .py script itself ========")
print("##")
hard1s = "ÿ"
hard1b = hard1s.encode('utf_8')
print("hard1s: len", len(hard1s), " '" + hard1s + "'")
print("hard1b: len", len(hard1b), hard1b)
for i in range(0,len(hard1s)):
print("CodePoint[", i, "]", hard1s[i], "U+"+"{0:x}".upper().format(ord(hard1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ read into this .py script from a UTF-8 file ======")
print("##")
import codecs
file1 = codecs.open( 'u00FF.arg', 'r', 'utf-8' )
file1s = file1.readline()
file1s = file1s[:1] # remove \r
file1b = file1s.encode('utf_8')
print("file1s: len", len(file1s), " '" + file1s + "'")
print("file1b: len", len(file1b), file1b)
for i in range(0,len(file1s)):
print("CodePoint[", i, "]", file1s[i], "U+"+"{0:x}".upper().format(ord(file1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ via sys.argv from a call to .py from a .cmd) ===")
print("##")
argv1s = sys.argv[1]
argv1b = argv1s.encode('utf_8')
print("argv1s: len", len(argv1s), " '" + argv1s + "'")
print("argv1b: len", len(argv1b), argv1b)
for i in range(0,len(argv1s)):
print("CodePoint[", i, "]", argv1s[i], "U+"+"{0:x}".upper().format(ord(argv1s[i])).zfill(4) )
print(''' These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.''')
print()
################### END OF .py SCRIPT #####################################
Here is the output from the above code.
========================== BEGIN OUTPUT ================================
C:\>u00FF.cmd
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \xC3 \xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
## ====================================================
## ÿ via hard-coding in this .py script itself ========
##
hard1s: len 1 'ÿ'
hard1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected).
## ====================================================
## ÿ read into this .py script from a UTF-8 file ======
##
file1s: len 1 'ÿ'
file1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected
## ====================================================
## ÿ via sys.argv from a call to .py from a .cmd) ===
##
argv1s: len 2 '├┐'
argv1b: len 6 b'\xe2\x94\x9c\xe2\x94\x90'
CodePoint[ 0 ] ├ U+251C
CodePoint[ 1 ] ┐ U+2510
These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.
========================== END OF OUTPUT ================================