character based file stream in .NET

views:

answers:

character based file stream in .NET

I need to modify a textfile of unknown encoding in that I need to insert some text after the first occurence of a predefined string (e.g. "#markx#"). Is there a class in .NET that allows me to randomly access the content of a file but based on characters (as opposed to bytes). Since the Stream.Seek Methods work on byte basis I would not only need to know the encoding but also know if there are some special control bytes (such as the first bytes at the beginning of unicode file). I would love to have a class that abstact all this away and allows me to "say": seek to 25th character and add some string there just as a texteditor would do it.

You can't know what each character is without knowing what encoding the file is using.

You can loop through all encodings and try them one by one, or guess at the encoding.

Oded 2009-12-18 11:39:54

Have a look at TextReader and TextWriter.

Miroslav Bajtoš 2009-12-18 11:41:04

-1 this is not helpful. Too vague, and in any case TextReader, in and of itself, offers no solution to the issue of catering for the encoding requirements that the quesitoner raises

Rob Levine 2009-12-18 11:49:18

+1 A:

Given that characters can take a variable number of bytes this would be pretty tough to do without converting the bytes to characters with a TextReader.

You could wrap up a TextReader and give it a Seek method that ensures enough characters have been loaded to satisfy each request.

GraemeF 2009-12-18 11:42:17

The layer of abstraction over the standard stream "seek", would involve reading each character in turn from the file (by default .net assumes files are UTF-8), so any file that doesn't start with a BOM assumes that the file is UTF-8.

UTF-8 has variable size characters, so you can't know how many bytes a character takes up until you read that byte.

Therefore, you have to sequentially access each byte in the file to know where each byte starts/ends.

In conclusion, if you know the file is AscII, UTF-16 or UTF-32, you can do this because you know the size for each character (as far as I know, if I'm wrong, please correct me)

If it's UTF-8 you can't "seek" to a character.

Hope this helps,

Binary Worrier 2009-12-18 11:46:58

+1 A:

You can use a StreamReader to go through one character at a time - there isn't a Seek method, but you can still read byte-by-byte and so effectively implement your own seek.

With regard to encodings - you will need to have identified the encoding in order to use the StreamReader.

However, the StreamReader itself can help if you create it with one of the constructor overloads that allows you to supply the flag detectEncodingFromByteOrderMarks as true (or you can use Encoding.GetPreamble and look at the byte preamble yourself).

Both these methods will only help auto-detect UTF based encodings though - so any ANSI encodings with a specified codepage will probably not be parsed correctly.

Rob Levine 2009-12-18 11:47:45

ansaurus

tags:

views:

answers:

character based file stream in .NET

related questions