views:

62

answers:

3

I'm sure this is something very simple that I'm screwing up, but here goes:

I'm trying to parse a log file that is generally formatted in UNICODE (and I'll freely admit that I don't generally know much about UNICODE, but the first two bytes of the file are 0xFFFE, and there's a zero between every other character). The peculiar part is that this file appears to end lines with the byte sequence 0x0D000D0A, that is, \r\0\r\n, and that's apparently confusing my TextReader from reading it.

That is, every other line I print is filled with:

?????????????????? ???????????? ?      ?????????  ? ?????????????  ? ?????????????? ???? ??? ????? ???????????????????? ??? ???????????? ????????????????? ?????????????????????? ???????????????????? ?????? ????????????????????? ????????????? ?????

What is the recommended way for me to go about parsing this using C#? Or rather, what am I doing wrong?

Thanks!

Update: Sorry, I should have probably included the code I was using in my initial posting. Here it is:

FileStream fsa = File.Open(@"C:\InboxLOG.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
TextReader sr = new StreamReader(fsa, Encoding.Unicode, true);
string line = "";
while ((line = sr.ReadLine()) != null)
{              
     Console.WriteLine(line);
}

Using StreamReader(fsa) produces the same results.

+2  A: 

Please try this

StreamReader reader = new StreamReader(filePath, System.Text.Encoding.Unicode, true);

It seems like UTF16 encoding, 0xFFFE is byte order mark

http://en.wikipedia.org/wiki/Byte%5Forder%5Fmark

S.Mark
+2  A: 

Hmmm... 0x0D000D0A?

Your line endings indeed look borked. You might have to parse it more manually via a Stream... I would have expected 0x0D000A000? (since this is little-endian). I wonder if a non-Unicode process has done a "replace lf with crlf" sweep and mucked it up. You could of course do the same, and (processing bytes in blocks of two) replace 0D0A with 0A00 (starting on even bytes only). But starting with non-corrupt data is always a better option...


was:

0xFFFE is a BOM, so anything involving StreamReader etc (such as File.OpenText) should handle this automatically and choose the right encoding. If not, give it a clue:

using(var reader = new StreamReader(path, Encoding.Unicode)) {
    ...
}
Marc Gravell
Thanks for the suggestion. I updated my question accordingly. When checking out what it was reading from the file using the debugger, it appears that StreamReader was consuming the BOM appropriately. I'm not sure if that helps or not, but just throwing it out there.
mrduclaw
"I wonder if a non-Unicode process has done a "replace lf with crlf" sweep and mucked it up" sounds like a very good guess.Maybe some internet protocol? ftp (without bin)?
Mihai Nita
With regard to the borking of line endings: I'm actually just trying to parse the MSFax Activity Log file on a WinXP Pro box. Since the file is kinda large, I'd rather not have to make a copy of it every time a fax comes in and I need to reparse it. I'll check out manually splitting it. Thanks again! And please continue to suggest stuff.
mrduclaw
+1  A: 

I'm guessing you're actually using a StreamReader as TextReader is an abstract class.

From your description you text is in UTF-16, but StreamReader defaults to UTF-8. When you construct your StreamReader, you need to tell it to use UTF-16 instead:

new StreamReader(..., System.Text.Encoding.Unicode);
R Samuel Klatchko