views:

5825

answers:

5

When I open cmd.exe in Windows, what encoding is it using? How can I check which encoding it is currently using? Does it depend on my regional setting or are there any environment variables to check?

What happens when you type a file with a certain encoding? Sometimes I get garbled characters (incorrect encoding used) and sometimes it kind-of works. However I don't trust anything as long as I don't know what's going on. Can anyone explain?

+2  A: 

To answer your second query re. how encoding works, Joel Spolsky wrote a great introductory article on this. Strongly recommended.

Brian Agnew
I've read it and I know it. However, on Windows I always feel lost because the OS and most applications seem totally ignorant of encoding.
danglund
+1  A: 

Command CHCP shows current codepage. It has 3 difgits: 8xx and is different from windows 12xx. So typing a english only text you wouldn't see any difference, but extended codepage (like Cyrillic) will be printed wrongly.

Dewfy
+1  A: 

type

chcp

to see your current code page. (as Dewfy already said).

nlsinfo

to see all installed code pages and find out what that your code page number means.

edited : You need to have Windows Server 2003 Resource kit installed (works on Windows XP) to use nlsinfo

Cagdas Altinkaya
Interestingly, `nlsinfo` doesn't appear to exist on my Windows 7.
Joey
`nlsinfo` also doesn't exist on my Windows XP SP3 machine.
Thomas Owens
Oh, I'm sorry. I think it comes with Windows Server Resource Kit tools. I've used it a couple of times on my Windows XP SP3 machine earlier and didn't know it wasn't installed by default.
Cagdas Altinkaya
Ah, that explains why it's there on my Vista machine, where I installed those.
Joey
+11  A: 

While chcp does indeed show the current code page cmd uses, it is of little to no relevance depending on your settings and how you started cmd.

First of all: Your console font determines what the console window is capable of displaying. More on that below.

Secondly, for Unicode files the current codepage only determines what gets displayed, depending on the font used (again, see below). For non-Unicode files the interpretation of the bytes is left to the current codepage, indeed:

> chcp 850
Active code page: 850

> type 1251.txt
abcde xyz
ÓßÔÒõ ²■ 

> chcp 1251
Active code page: 1251

> type 1251.txt
abcde xyz
абвгд эюя

(Will show up garbled if you have raster fonts enabled, but will copy fine.)


For the following I prepared a little test file, containing letters from different cultures:

ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    ąęźżńł
Russian   абвгдеж эюя
CJK       你好
  • If you use "Raster Fonts" then the console window will be confined th the codepage chcp shows. Unfortunately this is still the default in Windows 7 and I wish they wouldn't stick to such stupid defaults. However, the characters that are displayable still depend on the system you have, in my case the raster fonts are only for Latin and won't display Russian or otherwise.

    > chcp 850
    Active code page: 850
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    äöü ÄÖÜ ß
    Polish    aezznl
    Russian   ??????? ???
    CJK       ??
    
    
    > chcp 437
    Active code page: 437
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    äöü ÄÖÜ ß
    Polish    aezznl
    Russian   ??????? ???
    CJK       ??
    

    Note that in both CP850 and CP437 the German umlauts and ß work fine. The polish letters ąęźżńł get converted as good as possibly to their closest fits in ASCII, whereas for Russian or CJK ideographs there is no such easy replacement, which is why they become question marks.

    > chcp 1251
    Active code page: 1251
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    aou AOU ?
    Polish    aezznl
    Russian   абвгдеж эюя
    CJK       ??
    

    1251 is the ANSI codepage for Cyrillic, as you can see, it lacks both umlauts and Polish letters, but they can get converted to their closest equivalent in that codepage, unlike ß which just becomes a question mark again. But Russian now works correct.

    > chcp 1250
    Active code page: 1250
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    äöü ÄÖÜ ß
    Polish    ąęźżńł
    Russian   ??????? ???
    CJK       ??
    

    1250 is the ANSI codepage for Central European, which includes Polish, also German special letters are also included which is nice when talking to German-speaking Poles. However, Russian and Chinese are not there and thus just get question marks again.

    Interesting to note is that when using raster fonts, the console window's copy/paste abilities will cause text to be copied tied to the selected codepage (probably it's copied in Unicpde anyway) so even when one isn't able to see Russian due to font issues, it copies fine, as long as one is in CP1251, or 866, or 855 (well, there are many of them :-)).

  • If you select a Unicode font, such as Lucida Console or Consolas, then you will be able to see and type Unicode characters on the console, regardless of what chcp says:

    > chcp 850
    Active code page: 850
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    äöü ÄÖÜ ß
    Polish    ąęźżńł
    Russian   абвгдеж эюя
    CJK       你好
    
    
    > chcp 437
    Active code page: 437
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    äöü ÄÖÜ ß
    Polish    ąęźżńł
    Russian   абвгдеж эюя
    CJK       你好
    
    
    > chcp 1251
    Active code page: 1251
    
    
    > type uc-test.txt
    ASCII     abcde xyz
    German    äöü ÄÖÜ ß
    Polish    ąęźżńł
    Russian   абвгдеж эюя
    CJK       你好
    

    (Note that the CJK characters probably only show up as boxes in your console, as they do here, but the characters are still correct, it's the font that lacks the glyphs.)

Then there is the encoding that is used when cmd is redirecting stuff to a file. This closely follows chcp, regardless of the font used for the console window. You can start cmd with

cmd /u

to cause it to redirect to files in Unicode (UTF-16, Little Endian in this case, as usual on Windows).

Joey
Thanks a lot for this detailed description. As always there is no short and easy answer when it comes to encoding, but this explains it beautifully. Thanks!
danglund
A: 

We have a problem with a DOS-application under Windows 7 (32bit). The DOS application runs just fine under code page 852. But when we call a DOS Shell from running DOS-app to execute an external program/application, the CP 852 changes to CP 437 as soon as DOS execution (DOS Shell) is finished. We have tried many tricks to resolve the bug but we're completely stuck now... :-( We really need help, please. Thank you very much for your time.

Vladimir Cvajniga
You might have better luck asking this on http://superuser.com/ . In any case, you should ask it as a standalone question, not as an answer to someone else's question.
Alan Moore
This is not an answer
Vertis