ansaurus

Question

Answer 1

+2 A:

Please try this

StreamReader reader = new StreamReader(filePath, System.Text.Encoding.Unicode, true);

It seems like UTF16 encoding, 0xFFFE is byte order mark

http://en.wikipedia.org/wiki/Byte%5Forder%5Fmark

S.Mark 2009-11-26 07:18:35

Answer 2

+2 A:

Hmmm... 0x0D000D0A?

Your line endings indeed look borked. You might have to parse it more manually via a Stream... I would have expected 0x0D000A000? (since this is little-endian). I wonder if a non-Unicode process has done a "replace lf with crlf" sweep and mucked it up. You could of course do the same, and (processing bytes in blocks of two) replace 0D0A with 0A00 (starting on even bytes only). But starting with non-corrupt data is always a better option...

was:

0xFFFE is a BOM, so anything involving StreamReader etc (such as File.OpenText) should handle this automatically and choose the right encoding. If not, give it a clue:

using(var reader = new StreamReader(path, Encoding.Unicode)) {
    ...
}

Marc Gravell 2009-11-26 07:19:07

Thanks for the suggestion. I updated my question accordingly. When checking out what it was reading from the file using the debugger, it appears that StreamReader was consuming the BOM appropriately. I'm not sure if that helps or not, but just throwing it out there.

mrduclaw 2009-11-26 07:30:11

"I wonder if a non-Unicode process has done a "replace lf with crlf" sweep and mucked it up" sounds like a very good guess.Maybe some internet protocol? ftp (without bin)?

Mihai Nita 2009-11-26 07:54:00

With regard to the borking of line endings: I'm actually just trying to parse the MSFax Activity Log file on a WinXP Pro box. Since the file is kinda large, I'd rather not have to make a copy of it every time a fax comes in and I need to reparse it. I'll check out manually splitting it. Thanks again! And please continue to suggest stuff.

mrduclaw 2009-11-26 08:05:26

Answer 3

+1 A:

I'm guessing you're actually using a StreamReader as TextReader is an abstract class.

From your description you text is in UTF-16, but StreamReader defaults to UTF-8. When you construct your StreamReader, you need to tell it to use UTF-16 instead:

new StreamReader(..., System.Text.Encoding.Unicode);

R Samuel Klatchko 2009-11-26 07:23:27

ansaurus

tags:

views:

answers:

Parsing Peculiar Newlines

related questions