views:

145

answers:

5

Consider this program:

#include <stdio.h>
int main() {
    printf("%s\n", __FILE__);
    return 0;
}

Depending on the name of the file, this program works - or not. The issue I'm facing is that I'd like to print the name of the current file in an encoding-safe way. However, in case the file has funny characters which cannot be represented in the current code page, the compiler yields a warning (rightfully so):

?????????.c(3) : warning C4566: character represented by universal-character-name '\u043F' cannot be represented in the current code page (1252)

How do I tackle this? I'd like to store the string given by __FILE__ in e.g. UTF-16 so that I can properly print it on any other system at runtime (by converting the stored UTF-16 representation to whatever the runtime system uses). To do so, I need to know:

  1. What encoding is used for the string given by __FILE__? It seems that, at least on Windows, the current system code page (in my case, Windows-1252) is used - but this is just guessing. Is this true?
  2. How can I store the UTF-8 (or UTF-16) representation of that string in my source code at build time?

My real life use case: I have a macro which traces the current program execution, writing the current sourcecode/line number information to a file. It looks like this:

struct LogFile {
    // Write message to file. The file should contain the UTF-8 encoded data!
    void writeMessage( const std::string &msg );
};

// Global function which returns a pointer to the 'active' log file.
LogFile *activeLogFile();

#define TRACE_BEACON activeLogFile()->write( __FILE__ );

This breaks in case the current source file has a name which contains characters which cannot be represented by the current code page.

A: 

As for the encoding, I'm going to guess it's what's used by the filesystem, probably Unicode.

As for dealing with it, how 'bout changing you code it something like:

#define TRACE_BEACON activeLogFile()->write( FixThisString(__FILE__ )); 

std::string FixThisString(wchar_t* bad_string) { .....}

(Implementation of FixThisString is left as an exercise for the student.)

James Curran
`__FILE__` is a `char` string not a `wchar_t` string. You'll need to use the preprocessor to prefix `L` to it if you want to do this. And then you can use the right `printf`-family function to print it.
R..
@R: The error he is getting is that string he is printing contains a `'\u043F'` which would be a 16-bit, Unicode wchar_t.
James Curran
A: 

The best solution is to use source filenames in the portable filename character set [A-Za-z0-9._-]. Since Windows does not support UTF-8, there's no way for arbitrary non-ASCII characters to be represented in ordinary strings without dependence on your configured local language.

gcc probably does not care; it treats all filenames as 8bit strings and so if the filename is accessible to gcc, its name will be representable. (I know cygwin provides a UTF-8 environment by default, and modern *nix will normally be UTF-8.) For MSVC, you might be able to use the preprocessor to prepend L to expansion of __FILE__ and use %ls to format it.

R..
Care to explain the -1?
R..
+3  A: 

Use the __WFILE__ macro:

#define WIDEN2(x) L ## x
#define WIDEN(x) WIDEN2(x)
#define __WFILE__ WIDEN(__FILE__)

int main() {
    wprintf("%s\n", __WFILE__);
    return 0;
}
Hans Passant
This looks very interesting! However, it triggers a follow-up question: what encoding does the wide-character string use? UTF-16? Or is it a plain, unencoded, UCS-2 string? Right now it seems to me that this merely 'delays' the issue. However, it's much better than my current code so +1 from me.
Frerich Raabe
Unfortunately, it doesn't seem to work as expected: it just prints '???????' in case the file has a russian name. This is the same I see when listing the file with 'dir'. Maybe `__FILE__` is really tied to the filesystem encoding, but it doesn't honour whatever field Windows explorer uses to show the russian characters?
Frerich Raabe
Works on my machine. Are you using a console mode program? Did you switch the console to a Cyrillic code page with a font that supports the glyphs? SetConsoleCP(1251) for example with, say, the Consolas font. The default console encoding is OEM, it doesn't have the glyphs.
Hans Passant
I'm using a console program (no `/SUBSYSTEM:WINDOWS` passed to the linker) but I'm actually printing the string via `OutputDebugStringW`. It really seems to be a font issue; printing the individual bytes of the string yields e.g. 0x043f 0x0440 0x043e which is certainly not the Unicode code for '?'. Accepting this answer, thanks a lot!
Frerich Raabe
A: 

In MSVC, you can turn on Unicode and get UTF-16 encoded strings. It's in the project properties somewhere. In addition, you should just use wcout/cout not printf/wprintf. Windows needed Unicode before Unicode existed, so they had a custom multi-byte character encoding, which is the default. However, Windows does support UTF16- it's for example, C#.

#include <iostream>

int main() {
    std::wcout << __WFILE__;
}
DeadMG
A: 

__FILE__ will always expand to character string literal, thus in essence it will be compatible to char const*. This means that a compiler implementation has not much other choice than using the raw byte representation of the source file name as it presents itself at compile time.

Whether or not this is something sensible in the current locale or not doesn't matter, you could have a source file name that contains basically garbage, as long as your run time system and compiler accept it as a valid file name.

If you, as a user, have a different locale with different encoding than is used in your file system, you will see a lot of ???? or alike.

But if both your locales agree upon the encoding, a plain printf should suffice and your terminal (or whatever you use to look at the output) should be able to print the characters correctly.

So the short answer is, it will only work if your system is consistent w.r.t encoding. Otherwise your out of luck, since guessing encodings is a quite difficult task.

Jens Gustedt