views:

1024

answers:

4

Using the latest version of the Microsoft Compiler (included with the Win7 SDK), I'm attempting to compile a source file that's encoded using UTF-8 with unicode line separators.

Unfortunately, the code will not compile -- even if I include the UTF-8 signature at the start of the file. For example, if I try to compile this:

#include <stdio.h>

int main (void)
{
    printf("Hello!");
    return 0;
}

I'll see the following error:


Prompt> cl test.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved.

test.c test.c(1) : warning C4067: unexpected tokens following preprocessor directive - expected a newline Microsoft (R) Incremental Linker Version 9.00.30729.01 Copyright (C) Microsoft Corporation. All rights reserved.

/out:test.exe test.obj LINK : fatal error LNK1561: entry point must be defined


Has anyone encountered this problem before? Any solutions?

Thanks! Andrew

+1  A: 

When you say "unicode line separators" do you mean UTF-16/UCS-2 (ie., 16-bit characters)? If that's the case (the file is a mix of different encodings), I'd say the only reasonable fix is to fix the files.

If you mean the line endings are some other Unicode code point (still encoded in UTF-8), then you'll still need to fix the files. The standard says this about the first phase of translation:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing newline characters for end-of-line indicators) if necessary.

Apparently MS does not perform this translation for the 'unicode line separators', so you'll need to.

Michael Burr
Using Visual Studio's "Advanced Save Options" dialog, I'm specifying a UTF-8 encoding using Unicode line separators. The line separator is being encoded as UTF-8 as it should be.I've used a hex editor to verify the new line value is '0xE2 0x80 0xA8', which is indeed utf8.
But VS isn't looking for 0xE2 0x80 0xA8. It wants 0x0d 0x0a. It doesn't matter if you're morally in the right, it wants that 0x0a, which is is still perfectly valuid utf8 anyway.
Logan Capaldo
Interesting. That's an option I've never used. Unfortunately, it looks like MSVC does not support source files in that format, even though the editor does (I suppose you might want your programs to be able to deal with such data files). Just curious - do you know if another compiler (GCC?) does?
Michael Burr
If you feel strongly that this type of source encoding should be supported, you can post a bug report/change request on http://connect.microsoft.com/VisualStudio.
Michael Burr
Yea, I think you're right Logan. I was just hoping there might be some obscure compiler option to allow it to support this encoding, particularly since the editor supports it just fine. Thanks for all the help everyone.
A: 

Seems pretty obvious to me, there needs to be a newline after the #include .

Newlines are still unicode, so it shouldn't be that big of a deal to add one.

Logan Capaldo
Well, windows encodes a new line as CRLF, where as Unix encodes them as LF. The Unicode definition attempts to fix these conflicting new line implementations by defining a 'Unicode new line.' Can be read about here: http://en.wikipedia.org/wiki/Unicode#New_lines
+1  A: 

Are you're referring to this character, as opposed to the traditional CR LF characters.

I'd guess that the compiler is expecting some combination of CR and LF only.

nbeyer
Yea, that's the character I'm referring to. I was really hoping this would be supported by the MS Compiler. It's odd, since Visual Studio provides the option to save the file using this encoding but can't compile it.
Don't forget the editor is independent from the compiler, and it's intended to be useful for files other than source files. Still, I can understand why one might want or expect the compiler to support those line endings. It's not an unreasonable expectation - but I'm not surprised that it doesn't.
Michael Burr
+1  A: 

Submitted a bug report to Microsoft with ID 414985. Meh. We'll see what becomes of it.