.NET Stream Decoders behavior

Hello,

I've got a process which attempts to decode different encodings of strings from a binary stream. I get some behavior which does not quite add up in my mind when I step through it. Specifically, what I do is:

obtain the maximum number of bytes which would be used to encode a character in the given encoding
grab the amount of bytes from the stream
use Encoding.GetCharCount to determine just how many characters might have been encoded in those bytes (could be 0 one or two...)
if its not zero i use Encoding.GetString to grab the characters out of the byte array
i then figure out how many bytes were used to encode the extracted characters and advance the stream index by that amount
if the number of decodable bytes turns out to be zero i advance the index by one byte and try the whole thing again...in this fashion i expect not to miss any decodable characters

BTW, if anyone notices any incorrect assumption made in the above, feel free to say so...

I have my decoders set to throw DedcoderFallbackExceptions when they cannot decode a given set of bytes. What confuses me is that some times the exception arises when I call GetCharCount and other times it occurs when I call GetString. Is there any reason this should be happening? Is this in fact expected? I would like to be able to reliably check for the presence of printable characters in as few places as possible - currently I'm doing it in several places.

Any thoughts?

thanks, brian

BIG UPDATE: It seems that my initial description of the problem is lacking a bit. Let me add a few more premises to the problem:

the stream could be extremely large - it will not fit in memory for most users
at any given place in the stream i don't know for sure that I am at the beginning of text, in the middle of text
at any given place in the stream i don't know if i am in the middle or beginning of a multi byte character
the stream will contain much material that is in fact not text of any sort, as well as a smattering of different encodings

Hopefully this clarifies some of the issues. Responses so far have been very helpful! Please do continue!

excellent point - i think i have accounted for this problem, though i did not mention any of that in my original post...i'll elaborate.

sweeney 2009-07-01 23:10:50

hmm now that i think about it, it seems possible that i could erroneously decode material that looked like text but was in fact simply coincidence at which point i would be on an incorrect offset, right? is there any way around this?

sweeney 2009-07-01 23:15:47

Probably. It could happen depending on the encoding. To do this safely, you should prefix them with the number of bytes and the encoding. What if the bytes looked like valid text in another encoding too and you tried that one first?

Mehrdad Afshari 2009-07-01 23:22:03

First of all, I'm not the one performing the initial encoding, so i must assume that the data is more or less intact when I get to it. Second, I could try different encodings on the same chunks of bytes, but then what do I do - see which one decodes the most characters?

sweeney 2009-07-02 22:11:11

I mentioned another problem: what if the byte sequence was valid in two different encodings? which one would you choose?

Mehrdad Afshari 2009-07-02 22:13:37

Right, thats what I'm not sure about... I suppose you could continue trying more and more bytes and finally assume the encoding that seemed to be working out best. I think that leaves a lot of room for error and it would involve quite a bit of backtracking through the byte stream... Not sure how to deal with this...

sweeney 2009-07-03 00:16:47

Indeed there's always for room for error. Web browsers, for example, use heuristics to detect encodings but personally, I've faced wrong encoding detection and showing crap on a Web page many times and had to manually set the encoding of that page. In fact, encoding is like a contract. You should have agreed upon one before communication. Otherwise, it'll come down to guessing (which is not easy at all and you should use a library for that) and yet there will be room for error.

Mehrdad Afshari 2009-07-03 00:52:08

Alright, it looks like I'm heading in that direction as per arbiter's contributions below. thanks for the help!

sweeney 2009-07-03 01:35:40

Unfortunately I dont believe so. I do not know the beginning of the text nor do I know the actual encoding at any given spot. I think this prevents me from using GetEncoding, right?

sweeney 2009-07-02 21:46:48

Your assumption is indeed correct. I'm going to have a look at the libraries you've posted as soon as a get a chance. Might be a few days, but if it works I'll check back and update the thread.

sweeney 2009-07-02 21:48:05

Actually I didn't used Ude by myself, because I've found it only month ago. But MultiLang works for me very well. However I suggest first try Ude because this is fully managed solution.

arbiter 2009-07-02 22:25:39

Just tried UDE using the sample code provided on the site and a very simple .txt file as the input stream. It cannot figure out the encoding. I also tried a docx file with the same lack of results...

sweeney 2009-07-14 16:16:28

It seems to work only for websites (i.e. open a site, download the page, open it as a filestream and use that as the input). while this is cool, it still does not solve the problem. definitely worth checking out though.

sweeney 2009-07-14 18:57:55

ansaurus

tags:

views:

answers:

.NET Stream Decoders behavior

related questions