views:

54

answers:

2

When I invoke a Python 3 script from a Windows batch.cmd, a UTF-8 arg is not passed as "UTF-8", but as a series of bytes, each of which are interpreted by Python as individual UTF-8 chars.

How can I convert the Python 3 arg string to its intended UTF-8 state?
The calling .cmd and the called .py are shown below.

PS. As I mention in a comment below, calling u00FF.py "ÿ" directly from the Windows console commandline works fine. It is only a problem when I invoke u00FF.cmd via the .cmd, and I am looking for a Python 3 way to convert the double-encoded UTF-8 arg back to a "normally" encoded UTF-8 form.

I've now include here, the full (and latest) test code.. Its a bit long, but I hope it explains the issue clearly enough.

Update: I've seen why the file read of "ÿ" was "double-encoding"... I was reading the UTF-8 file in binary/byte mode... I should have used codecs.open('u00FF.arg', 'r', 'utf-8') instead of just plain open('u00FF.arg','r')... I've updated the offending code, and the output. The codepage issues seems to be the only problem now...

Because the Python issue has been largely resolved, and the codepage issue is quite independent of Python, I have posted another codepage specific question at
Codepage 850 works, 65001 fails! There is NO response to “call foo.cmd”. internal commands work fine.

:::::::::::::::::::   BEGIN .cmd BATCH FILE ::::::::::::::::::::
:: Windows Batch file (UTF-8 encoded, no BOM): "u00FF.cmd" 
   @echo ÿ>u00FF.arg
   @u00FF.py "ÿ"  
   @goto :eof  
:::::::::::::::::::   END OF .cmd BATCH FILE ::::::::::::::::::::

################### BEGIN .py SCRIPT #####################################  
    # -*- coding: utf-8 -*-

    import sys
    print ("""
    Unicode
    =======
        CodePoint U+00FF
        Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'

    UTF-8 bytes
    ===========
        Hex: \\xC3 \\xBF
        Dec:  195  191
        Char:   Ã    ¿ __Unicode Character 'INVERTED QUESTION MARK'
                 \_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE' 

    """)
    print("## ====================================================")
    print("## ÿ via hard-coding in this .py script itself ========")
    print("##")
    hard1s = "ÿ"
    hard1b = hard1s.encode('utf_8')
    print("hard1s: len", len(hard1s), " '" + hard1s + "'")
    print("hard1b: len", len(hard1b),        hard1b)
    for i in range(0,len(hard1s)):
        print("CodePoint[", i, "]", hard1s[i], "U+"+"{0:x}".upper().format(ord(hard1s[i])).zfill(4) )
    print('''         This is a single CodePoint for "ÿ" (as expected).''')
    print()
    print("## ====================================================")
    print("## ÿ read into this .py script from a UTF-8 file ======")
    print("##")
    import codecs
    file1 = codecs.open( 'u00FF.arg', 'r', 'utf-8' )
    file1s = file1.readline()
    file1s = file1s[:1] # remove \r
    file1b = file1s.encode('utf_8')
    print("file1s: len", len(file1s), " '" + file1s + "'")
    print("file1b: len", len(file1b),        file1b)
    for i in range(0,len(file1s)):
        print("CodePoint[", i, "]", file1s[i], "U+"+"{0:x}".upper().format(ord(file1s[i])).zfill(4) )
    print('''         This is a single CodePoint for "ÿ" (as expected).''')
    print()
    print("## ====================================================")
    print("## ÿ via sys.argv  from a call to .py  from a .cmd) ===")
    print("##")
    argv1s = sys.argv[1]
    argv1b = argv1s.encode('utf_8')
    print("argv1s: len", len(argv1s), " '" + argv1s + "'")
    print("argv1b: len", len(argv1b),        argv1b)
    for i in range(0,len(argv1s)):
        print("CodePoint[", i, "]", argv1s[i], "U+"+"{0:x}".upper().format(ord(argv1s[i])).zfill(4) )
    print('''         These 2 CodePoints are way off-beam,
                     even allowing for the "double-encoding" seen above.
                     The CodePoints are from an entirely different Unicode-Block.
                     This must be a Codepage issue.''')
    print()
################### END OF .py SCRIPT #####################################  

Here is the output from the above code.

========================== BEGIN OUTPUT ================================
    C:\>u00FF.cmd
    Unicode
    =======
        CodePoint U+00FF
        Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'

    UTF-8 bytes
    ===========
        Hex: \xC3 \xBF
        Dec:  195  191
        Char:   Ã    ¿ __Unicode Character 'INVERTED QUESTION MARK'
                 \_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'


    ## ====================================================
    ## ÿ via hard-coding in this .py script itself ========
    ##
    hard1s: len 1  'ÿ'
    hard1b: len 2 b'\xc3\xbf'
    CodePoint[ 0 ] ÿ U+00FF
                     This is a single CodePoint for "ÿ" (as expected).

    ## ====================================================
    ## ÿ read into this .py script from a UTF-8 file ======
    ##
    file1s: len 1  'ÿ'
    file1b: len 2 b'\xc3\xbf'
    CodePoint[ 0 ] ÿ U+00FF
                     This is a single CodePoint for "ÿ" (as expected

    ## ====================================================
    ## ÿ via sys.argv  from a call to .py  from a .cmd) ===
    ##
    argv1s: len 2  '├┐'
    argv1b: len 6 b'\xe2\x94\x9c\xe2\x94\x90'
    CodePoint[ 0 ] ├ U+251C
    CodePoint[ 1 ] ┐ U+2510
                     These 2 CodePoints are way off-beam,
                     even allowing for the "double-encoding" seen above.
                     The CodePoints are from an entirely different Unicode-Block.
                     This must be a Codepage issue.
========================== END OF OUTPUT ================================
+1  A: 

Windows shell uses a specific code page (see CHCP command output). You need to convert from Windows code page to utf-8. See iconv module or decode() / encode()

RC
@RC: I don't think it is the fault of the command shell, as calling u00FF.py "ÿ" directly from the Windows console commandline works fine. It seems to be an issue with the batch-interpreter. I've bemoned this issue in a bit more detail on [SuperUser][1]. [1]: http://superuser.com/questions/170447/using-windows-cmd-exe-and-a-batch-cmd-can-i-copy-xcopy-a-unicode-filename-f iconv.exe convets between encodings(and is actually how I create the UTF-8 from a DIR command in the first place, and encode() seemed to compound it, and decode() doesn't have anything to decode. It is already UTF-8
fred.bear
@RC: There are two things happening... I have a codepage issue as you said, and I am getting a double-encoding of the UTF-8... When I write "ÿ" as UTF-8 to a file and read it in the .py script, the correct bytes are there, but they are quite different to the argv values, which have been passed as bytes, which Python then reads as one Unicode codepoint per byte (thereby double UTF-8 encoding) ... I'll keep poking around in the codepage "weirdness".. maybe one day eveything will be UTF-8 (or UTF-something) for everyone... I live in hope, but I'm not holding my breath.
fred.bear
@orthogonal, a unix shell might help for all utf-8 ;)
RC
@RC: Thanks. I'm SURE you are right, but (good or bad) Windows exists, and I'll keep trying to understand its odd ways... and as an offshoot, I've finally stepped into the world of Python .. and I'm liking what I see... Most of my utility tools are Unix based... Due to some strange twist of fate, Windows just happens to be here... Next step: Bash :)
fred.bear
+2  A: 

Batch files and encodings are a finicky issue. First of all: Batch files have no direct way of specifying the encoding they're in and cmd does not really support Unicode batch files. You can easily see that if you save a batch file with a Unicode BOM or as UTF-16 – they will throw an error.

What you see when you put the ÿ directly into the command line is that when running a command Windows will initially use the command line as Unicode (it may have been converted from some legacy encoding beforehand, but in the end what Windows uses is Unicode). So Python will (hopefully) always grab the Unicode content of the arguments.

However, since cmd has its own opinions about the codepage (and you never told it to use UTF-8) the UTF-8 string you put in the batch file won't be interpreted as UTF-8 but instead in the default cmd codepage (850 or 437, in your case).

You can force UTF-8 with chcp:

chcp 65001 > nul

You can save the following file as UTF-8 and try it out:

@echo off
chcp 850 >nul
echo ÿ
chcp 65001 >nul
echo ÿ

Keep in mind, though, that the chcp setting will persist in the shell if you run the batch from there which may make things weird.

Joey
@Johannes: Thanks for the further insight... I have tried chcp 65001 (previously, and again now), but it has the alarming effect of stopping me from invoking any .cmd from the commandline... eg. // ComSpec=C:\WINDOWS\system32\cmd.exeRun-Dialog cmd.exe (with and without /U ) ---- chcp // is 850 by default ---- chcp 65001 ---- x.cmd // This fails (it contains @exit) ---- Applying chcp 65001 to the console allows built-in and .exe and .py commands to run, but renders .bat and .cmd unusable!? (strange!)Applying chcp 65001, within the .cmd only, has the same effect... ---- Any ideas, anyone ?
fred.bear
I'm confused now what exactly your problem is here. Care to edit it into your question with proper formatting?
Joey
I'm sure (now) that my original problem is, at least, a codepage issue, but using chcp 65001 has introduced another "new" problem (this may be the point of confusion)... My calling .cmd is UTF-8 encoded (no BOM) is as shown (only 3 lines)... The called .py is exactly as shown (7 lines)... When I (now) add chcp 65001 at the commandline OR within the .cmd itself, the chcp 65001 somehow "disables" calls to .cmd (and .py when the chcp 65001 was applied in the .cmd) ... I can't think of any way to format it other than add one "chcp 65001" line (but that was not part of the original problem)..
fred.bear
I've now posted the (new) full code.
fred.bear