1

I'm testing the behaviour of Windows terminal (cmd.exe) in relation to charset encodings. I have some test files in several encodings (Win1252, CP437, UTF-8, etc) with the Spanish text: "qué tal"

I open a CMD.exe terminal in my Windows 10 machine, with the default CP 437 code page (I check that in the terminal window Properties). And, indeed, the type command gives the expected output: correct for CP-437, only

 C:\temp > type testfile-cp437.txt
 qué tal         (OK)
 C:\temp > type testfile-utf8.txt
 qué tal        (WRONG)
 

All good till now.

I also have installed Git for Windows with its linux-like binaries.

Now, I run its cat.exe (in the same terminal, mind you - I don't even open the bash.exe executable) and now the results are different. It seems all works in UTF-8

 C:\temp > C:\Git\usr\bin\cat.exe testfile-cp437.txt
 qu□ tal         (WRONG)
 C:\temp > C:\Git\usr\bin\cat.exe testfile-utf8.txt
 qué tal        (OK)

Why is this so? I expected the cat command to simply send the bytes to the terminal, so that the results should be the same. Where is the bytes-to-UTF-8 decoding taking place here? Who and why is choosing the UTF-8 encoding ? Is this some implementation detail of this cat instance or what ?

1 Answer 1

2

(in the same terminal, mind you - I don't even open the bash.exe executable)

That would still be the same terminal. Neither cmd.exe nor bash.exe are terminals on their own – you're doing everything in the Windows console (Conhost) which Windows automatically spawns for 'console' executables.

The Windows console is not really like your usual terminal, and doesn't just use stdio as its only interface – it has a whole API around it. And like most things in Windows, it deals with UTF-16 as its primary text encoding.

For example, although programs can output text to their stdout using ordinary WriteFile(), there is also a dedicated WriteConsole() function, which (like most Windows APIs) comes in two versions: byte-oriented WriteConsoleA(), which expects data in the current ANSI/OEM encoding, and Unicode-oriented WriteConsoleW() which always takes UTF-16.

So if programs know that they're dealing with text in a known encoding, and if they're writing to a console, they don't need to rely on the "current OEM codepage" – the program could do its own conversion to UTF-16 and then use WriteConsoleW() to directly output the text in Unicode.

(Even Cmd's built-in type command does something like that: if it detects your file as having the UTF-16 BOM, it will output its contents as Unicode regardless of the active codepage.)

Tools found in Git for Windows are compiled using the MinGW runtime, which like Cygwin tries to smooth out certain differences between POSIX and Windows environments. It seems that MinGW's stdio layer has special handling for Windows consoles – remember that Git deals with UTF-8 data a lot, so it wouldn't work well in a console set up for CP437 – and so whenever MinGW detects that it's writing text to a console, it will automatically convert from UTF-81 to UTF-16 and directly output it as Unicode using WriteConsoleW()2.

This way Git.exe itself does not need to worry about OEM codepages – e.g. git log can simply output UTF-8 encoded author names or commit messages as-is (like it would on Linux) and let the MinGW runtime magically convert that to Windows-compatible Unicode, bypassing the OEM codepage conversion that would otherwise garble everything.


1 (MinGW actually performs this conversion according to POSIX locale settings, so if you set the LANG or LC_CTYPE environment variable to something like C.cp437, you will see the MSYS tools handling all text as if it was in CP437 instead.)

2 (Some programs might also be using SetConsoleOutputCP() to temporarily switch the console to actual UTF-8 as the 'OEM' codepage – but it's more likely that MinGW uses WriteConsoleW() as that doesn't have any lasting effects after the program crashes, whereas the output CP would need to be explicitly restored on exit.)

1
  • Excellent answer, thanks!
    – leonbloy
    Commented Jun 18, 2022 at 13:16

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .