tags:

views:

204

answers:

5

Further to this question I've got a supplementary problem.

I've found a track with an "É" in the title.

My code:

var playList = new StreamWriter(playlist, false, Encoding.UTF8);

-

private static void WriteUTF8(StreamWriter playList, string output)
{
    byte[] byteArray = Encoding.UTF8.GetBytes(output);
    foreach (byte b in byteArray)
    {
        playList.Write(Convert.ToChar(b));
    }
}

converts this to the following bytes:

195
137

which is being output as à followed by a square (which is an character that can't be printed in the current font).

I've exported the same file to a playlist in Media Monkey at it writes the "É" as "É" - which I'm assuming is correct (as KennyTM pointed out).

My question is, how do I get the "‰" symbol output? Do I need to select a different font and if so which one?

UPDATE

People seem to be missing the point.

I can get the "É" written to the file using

playList.WriteLine("É");

that's not the problem.

The problem is that Media Monkey requires the file to be in the following format:

#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3

Where all the "high-ascii" (for want of a better term) are written out as a pair of characters.

UPDATE 2

I should be getting c9 replaced by c3 89.

I was going to put what I'm actually getting, but in doing the tests for this I've managed to get a test program to output the text in the right format "as is". So I need to do some more investigation.

+2  A: 

I don't do C# but the symptoms tell me that you're indeed writing it as UTF-8, but that the output/console/application/whatever with which you're viewing the written output is not using UTF-8, but ISO-8859-1 to display them and that MediaMonkey is using CP1252 to display them.

If you're viewing them in the IDE console, then you need to configure the IDE to use UTF-8 as console and text file encoding.

Update you apparently want to write UTF-8 data as CP-1252. Now the question/problem is more clear. Again, I don't do C#, but the Java equivalent would be:

Writer writer = new OutputStreamWriter(new FileOutputStream("file.ext"), "CP-1252");
writer.write(someUTF8String); // Will be written as CP-1252. "É" would become "É"

Hopefully this gives some insights.

BalusC
This was my first thought, but the "m3u" file generated by Media Monkey displays the "‰" correctly when edited in Notepad++, the same application as I'm using to view the file I generate.
ChrisF
Do you instruct Notepad++ to save/encode the file as UTF-8? Check the `Format` menu in top bar.
BalusC
Ah - checking the "Encoding" menu reveals that the correct file is encoded in ANSI *not* UTF8. Interesting.
ChrisF
This means that the file is not saved with an UTF-8 BOM and/or Notepad++ doesn't use UTF-8 as *default* encoding and/or Notepad++ fails to auto-guess the actual encoding.
BalusC
I need to write some lines "as is" and some lines converted.
ChrisF
The Java equivalent would be `String cp1252string = new String(utf8string.getBytes("UTF-8"), "CP-1252");`. You need to convert chars to bytes using UTF-8 first and then convert those bytes to chars using CP-1252. Although a bit Java targeted, you may find [this article](http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html) useful as well to get some ideas/thoughts.
BalusC
You shouldn't use a UTF-8 faux-BOM. It'll break many tools including some players. It has to be loaded by a text editor as “ANSI” (the misleading name for default system code page) because it contains a mixture of UTF-8-encoded and “ANSI” content; it is unlikely that the “ANSI” text happens to form valid UTF-8 byte sequences.
bobince
+3  A: 

Using Convert.ToChar like that is almost certainly a bad idea. You're basically encoding things twice.

You should either be performing the conversion yourself and then writing directly to a stream, or you should be letting the StreamWriter do the conversion. Why are you using a StreamWriter at all if you're trying to perform the conversions yourself?

Are you trying to write to a binary file, or a simple text file? If it's a simple text file, just use a StreamWriter and let that do the conversion. If it's a binary file, use a Stream instead of a StreamWriter, and perform text encoding directly where you need to, writing the bytes straight to the stream afterwards.

EDIT: Here's what's happening with your original code:

Encoding.UTF8.GetBytes(text) => byte[] { 0xc3, 0x89 };

Convert.ToChar(0xc3) => char U+00C3
StreamWriter writes U+00C3 as byte[] { 0xc3, 0x83 };

Convert.ToChar(0x89) => char U+0089
StreamWriter writes U+00C3 as byte[] { 0xc2, 0x89 };

So that's why you're getting c3 83 c2 89 written to the file.

Jon Skeet
I'm writing a simple text file - in fact an "m3u" file which needs the unicode characters written as "Ã" followed by whatever character the second byte translates to.
ChrisF
@ChrisF: It's not really clear what you mean by that. Is it a plain UTF-8 file or not? It's simplest to report what *bytes* are required rather than characters.
Jon Skeet
I *think* it's a plain UTF-8 file. Every other conversion I've come across is done correctly: "é" -> "é", "ó" -> "ó" etc. It's just this one that's failing.
ChrisF
@ChrisF: If it's just plain UTF-8, you shouldn't need to do your own encoding at all. Just write the text to the `StreamWriter`. But again, don't think of the encoding as translating one character to two characters. It encodes characters as *bytes*.
Jon Skeet
This doesn't explain that the *same* file is displayed *differently* in different applications.
BalusC
@Jon - see my update to the question. Just writing the test to the `StreamWriter` results in "É" being written to the file, but I need "É" written to the file instead.
ChrisF
@ChrisF: Once again, you're saying what you think you need in terms of *characters*. That doesn't make any sense, because the result is *bytes*.
Jon Skeet
@Jon - I see what you're saying. If I just write a line containing "É" into a `UTF8` encoded file I get "É", but if I write any plain text first I get "É". How can I mix both modes.
ChrisF
@ChrisF: It's not at all clear what you mean by "I get "É"" - that's a sequence of characters... where are you reading those characters, and what encoding is it using to decode the file?
Jon Skeet
@Jon - OK, I'm not using the correct terminology, but I need "É" to be converted into "É". These characters are being read by Notepad++ (so I can view the file) and Media Monkey (to play the music). See the update to the question for the actual format I need.
ChrisF
@ChrisF: If you refuse to tell us the *bytes* that you need in the file, then I'm afraid we can't really help you. That's what a file is made of: bytes. Open up the file in a hex editor so you can see the *exact data* without Notepad++ or Media Monkey "interpreting" the file according to an encoding.
Jon Skeet
@Jon - I'm not refusing to tell you. I've been distracted by other things ;)
ChrisF
@Jon - I should be getting `c3 89` out as a replacement for `c9`. However, looking at what I'm writing it shows `c3 83 c2 89` - which seems to have two extra bytes in the middle.
ChrisF
@ChrisF: Okay, now we're getting somewhere - although it's only bytes as *output* - the *input* is still the character "É". Now, if you just write "É" out using a `StreamWriter` you should indeed get c3 89, and that's what I'm seeing. Your c3 83 c2 89 is no doubt the result of the "double encoding" you're doing by calling `Encoding.GetBytes` and then `Convert.ToChar(byte)`. Just write the text to the `StreamWriter` and you'll get c3 89.
Jon Skeet
@Jon - I agree that's what I should get, and indeed it's what I've just started getting in my test program after experimenting with the different encodings and then closing the stream and re-opening it with the encoding appropriate encoding for the characters I want to output. I just need to translate this back to the actual program and check it still works.
ChrisF
+2  A: 

StreamWriter already converts the characters you send it to UTF-8 — that's its entire purpose. Throw WriteUTF8 away; it's broken and useless.

(WriteUTF8 is taking characters, converting them to UTF-8 bytes, converting each single byte to the character it maps to in the current code page, then encoding each of those characters in UTF-8. So in the best case you have a doubly-UTF-8-encoded string; in the worst, you've completely lost bytes that weren't mapped in the system code page repertoire; especially bad for DBCS code pages.)

The problem you're having with Media Monkey may be just that it doesn't support UTF-8 or Unicode filenames at all. Try asking it to play (and export a playlist for) files with characters that don't fit in your system codepage, for example by renaming a file to αβγ.mp3.

Edit:

#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3

OK, what you've got there is a mixture of encodings in the same file: it's no wonder text editors are going to have trouble opening it. The uncommented and #EXTINF lines are in the system default code page, and are present to support media players that can't read Unicode filenames. Any filename characters not present in the system code page (eg. Greek as above, on a Western Windows install) will be mangled and unplayable for anything that doesn't know about the #UTF8 (and #EXTINFUTF8 for the description) lines.

So if this is your target format, you'll need to grab two encodings and use each in turn, something like:

private static void writePlaylistEntry(Stream playlist, string filename, int length) {
    Encoding utf8= new UTF8Encoding(false);
    Encoding ansi= Encoding.Default;
    playlist.Write(utf8.GetBytes("#EXTINFUTF8:"+length+","+filename+"\n"));
    playlist.Write(ansi.GetBytes("#EXTINF:"+length+","+filename+"\n"));
    playlist.Write(utf8.GetBytes("#UTF8:"+filename+"\n"));
    playlist.Write(ansi.GetBytes(filename+"\n"));
}
bobince
@bobince - it does support UTF-8 filenames indirectly. It writes out the filename preceded by `#UTF8:` and any UTF8 characters split into two.
ChrisF
@ChrisF: edited re update.
bobince
@bobince - what Encoding do I use to open the stream?
ChrisF
You don't use an Encoding. It's a byte Stream rather than a character StreamWriter. You can't use a single StreamWriter because as it turns out the file isn't using just one encoding, it changes for every line.
bobince
D'oh - of course.
ChrisF
+2  A: 

The more fundamental problem is in the name of the method:

 private static void WriteUTF8(...)

.M3U files aren't UTF-8. They're Latin-1 (or Windows-1252).

Instead of Encoding.UTF8, you should be using Encoding.GetEncoding(1252). Then you can just write directly to the stream, you won't need any of this conversion weirdness.

Update:

I just tried the following C# code and the resulting .M3U opens just fine in both Winamp and WMP:

static void Main(string[] args)
{
    string fileName = @"C:\Temp\Test.m3u";
    using (StreamWriter writer = new StreamWriter(fileName, false,
        Encoding.GetEncoding(1252)))
    {
        writer.WriteLine("#EXTM3U");
        writer.WriteLine("#EXTINF:140,Yann Tiersen " +
            "- Comptine D'Un Autre Été: L'Après Midi");
        writer.WriteLine("04-Comptine D'Un Autre Été- L'Après Midi.mp3");
    }
}

So, as I said - just use the right encoding to begin with. You don't need all those extra #EXTINFUTF8 and #UTF8 lines, unless it's some bizarre requirement for Media Monkey (it's definitely not part of the basic M3U spec).

Aaronaught
This didn't help - I get "�" rather than "É" for when the split character.
ChrisF
@ChrisF: Did you replace the whole block with something that writes the encoded bytes directly to the stream, or did you only change `UTF8` to `1252` and use the same `Convert` loop?
Aaronaught
I tried both. I'm now getting something completely wrong though so I'm at a complete loss.
ChrisF
@ChrisF: I don't suppose I could trouble you to post the code that creates the `Stream` and `StreamWriter`? I'd like to test it myself, I'm still fairly certain that it's an encoding issue (even though it may appear to be something else). Also does the issue come up in Winamp (or any other media player) or just Media Monkey?
Aaronaught
@Aaronaught - The code that creates the `StreamWriter` is already in the question. I've only recently rebuilt my PC so I've only got Media Monkey and Windows Media Player installed. I'll try it in WMP.
ChrisF
Just tried in WMP. Got an error: "Windows Media Player encountered a problem while downloading the playlist. For additional assistance, click Web Help." Looks like the file is in the wrong format.
ChrisF
@ChrisF: Updated. Let me know if I'm still not getting it...
Aaronaught
Post-script: Just to satisfy my own doubts, I looked up the M3U spec and this is in fact correct; M3U really does not support UTF-8. These weird mixed encodings must be some proprietary feature of Media Monkey. If you need to put UTF-8 into an M3U file (i.e. because the ANSI code page doesn't support some of the characters) then you should be using M3U8, which is the same as M3U but encoded as UTF-8. Most media players should be capable of reading these.
Aaronaught
@Aaronaught - I don't think Media Monkey supports m3u8 (or at least my searches didn't find any information on it) - hence the need for the weird `UTF8` lines. Your revised code is what I had originally which won't open in Media Monkey - sorry.
ChrisF
@ChrisF: Well, looks like bobince has your answer then... my only concern would be that other media players could choke on these files; mixing encodings in a single text file is just not kosher. This really seems *a lot* like a bug in Media Monkey, since it opens in other players.
Aaronaught
@Aaronaught - I used the playlist that came out of Media Monkey in WMP without any problems. I'm guessing that it just ignored the lines with `UFT8` in.
ChrisF
A: 

Right, first off thanks to everyone for their help and patience.

I've finally got it working correctly. I've implemented a version of bobince's solution which is why he gets the acceptance (up-votes to everyone else). Here's my code:

var playList = new StreamWriter(playlist, false, Encoding.Default);
playList.WriteLine("#EXTM3U");

foreach (string track in tracks)
{
    // Read ID3 tags from file
    var info = new FileProperties(track);

    // Write extended info (#EXTINF:<time>,<artist> - <title>
    if (Encoding.UTF8.GetBytes(info.Artist).Length != info.Artist.Length ||
        Encoding.UTF8.GetBytes(info.Title).Length != info.Title.Length)
    {
        playList.Close();
        playList = new StreamWriter(playlist, true, Encoding.UTF8);

        playList.WriteLine(string.Format("#EXTINFUTF8:{0},{1} - {2}",
                           info.Duration, info.Artist, info.Title));

        playList.Close();
        playList = new StreamWriter(playlist, true, Encoding.Default);
    }

    playList.WriteLine(string.Format("#EXTINF:{0},{1} - {2}",
                       info.Duration, info.Artist, info.Title));

    // Write the name of the file (removing the drive letter)
    string file = Path.GetFileName(track);
    if (Encoding.UTF8.GetBytes(file).Length != file.Length)
    {
        playList.Close();
        playList = new StreamWriter(playlist, true, Encoding.UTF8);

        playList.WriteLine(string.Format("#UTF8:{0}", file));

        playList.Close();
        playList = new StreamWriter(playlist, true, Encoding.Default);
    }

    playList.WriteLine(file);
}

playList.Close();

As you can see I assume I'm not going to have to write UTF8, but when I do I close the stream and reopen it with UTF8 encoding. I then, after writing the offending line, close and reopen it with the default encoding.

Now I don't know why my previous code gave inconsistent results. Given what everyone (particularly Jon) said it should have failed all the time, or possibly worked all of the time.

ChrisF