views:

118

answers:

3

Hello,

I've got a process which attempts to decode different encodings of strings from a binary stream. I get some behavior which does not quite add up in my mind when I step through it. Specifically, what I do is:

  • obtain the maximum number of bytes which would be used to encode a character in the given encoding
  • grab the amount of bytes from the stream
  • use Encoding.GetCharCount to determine just how many characters might have been encoded in those bytes (could be 0 one or two...)
  • if its not zero i use Encoding.GetString to grab the characters out of the byte array
  • i then figure out how many bytes were used to encode the extracted characters and advance the stream index by that amount
  • if the number of decodable bytes turns out to be zero i advance the index by one byte and try the whole thing again...in this fashion i expect not to miss any decodable characters

BTW, if anyone notices any incorrect assumption made in the above, feel free to say so...

I have my decoders set to throw DedcoderFallbackExceptions when they cannot decode a given set of bytes. What confuses me is that some times the exception arises when I call GetCharCount and other times it occurs when I call GetString. Is there any reason this should be happening? Is this in fact expected? I would like to be able to reliably check for the presence of printable characters in as few places as possible - currently I'm doing it in several places.

Any thoughts?

thanks, brian

BIG UPDATE: It seems that my initial description of the problem is lacking a bit. Let me add a few more premises to the problem:

  • the stream could be extremely large - it will not fit in memory for most users
  • at any given place in the stream i don't know for sure that I am at the beginning of text, in the middle of text
  • at any given place in the stream i don't know if i am in the middle or beginning of a multi byte character
  • the stream will contain much material that is in fact not text of any sort, as well as a smattering of different encodings

Hopefully this clarifies some of the issues. Responses so far have been very helpful! Please do continue!

+2  A: 

Encodings like UTF8 use variable number of bytes for characters so you can't simply multiply the maximum number of bytes to fetch from the stream. The last byte might fall at the middle of a character (and might be invalid in isolation) or the last character might represent a completely different character on that specific encoding.

Mehrdad Afshari
excellent point - i think i have accounted for this problem, though i did not mention any of that in my original post...i'll elaborate.
sweeney
hmm now that i think about it, it seems possible that i could erroneously decode material that looked like text but was in fact simply coincidence at which point i would be on an incorrect offset, right? is there any way around this?
sweeney
Probably. It could happen depending on the encoding. To do this safely, you should prefix them with the number of bytes and the encoding. What if the bytes looked like valid text in another encoding too and you tried that one first?
Mehrdad Afshari
First of all, I'm not the one performing the initial encoding, so i must assume that the data is more or less intact when I get to it. Second, I could try different encodings on the same chunks of bytes, but then what do I do - see which one decodes the most characters?
sweeney
I mentioned another problem: what if the byte sequence was valid in two different encodings? which one would you choose?
Mehrdad Afshari
Right, thats what I'm not sure about... I suppose you could continue trying more and more bytes and finally assume the encoding that seemed to be working out best. I think that leaves a lot of room for error and it would involve quite a bit of backtracking through the byte stream... Not sure how to deal with this...
sweeney
Indeed there's always for room for error. Web browsers, for example, use heuristics to detect encodings but personally, I've faced wrong encoding detection and showing crap on a Web page many times and had to manually set the encoding of that page. In fact, encoding is like a contract. You should have agreed upon one before communication. Otherwise, it'll come down to guessing (which is not easy at all and you should use a library for that) and yet there will be room for error.
Mehrdad Afshari
Alright, it looks like I'm heading in that direction as per arbiter's contributions below. thanks for the help!
sweeney
+1  A: 

Wow. Sounds like mighty overkill. Have you tried using the GetDecoder method of your encoding? It hands you a Decoder with a GetChars method that you feed a byte array and a char array to and it fills the char array with the available characters decoded from the byte array.

If there's any overshoot (i.e. spare bytes) these are saved in the state of the decoder for the next time that you call GetChars with fresh bytes.

You can use a StringBuilder to assemble the result.

A little simpler than your method.

spender
Unfortunately I dont believe so. I do not know the beginning of the text nor do I know the actual encoding at any given spot. I think this prevents me from using GetEncoding, right?
sweeney
+1  A: 

If I understand your question correctly, you trying to read char data from byte stream with unknown encoding?

If my assumption is right, then you need first detect encoding, and read read byte stream using TextReader with this encoding, then you will not need to worry about different char sizes, TextReader will do all work for you.

I know two approaches to detect encoding from byte stream:

  1. Ude is a C# port of Mozilla Universal Charset Detector.
  2. IE MultiLang services
arbiter
Your assumption is indeed correct. I'm going to have a look at the libraries you've posted as soon as a get a chance. Might be a few days, but if it works I'll check back and update the thread.
sweeney
Actually I didn't used Ude by myself, because I've found it only month ago. But MultiLang works for me very well. However I suggest first try Ude because this is fully managed solution.
arbiter
Just tried UDE using the sample code provided on the site and a very simple .txt file as the input stream. It cannot figure out the encoding. I also tried a docx file with the same lack of results...
sweeney
It seems to work only for websites (i.e. open a site, download the page, open it as a filestream and use that as the input). while this is cool, it still does not solve the problem. definitely worth checking out though.
sweeney