tags:

views:

431

answers:

3

I'm working w/ a function that expects a string formatted as a utf-8 encoded octet string. Can someone give me an example of what a utf-8 encoded octet string would look like?

Put another way, if I convert 'foo' to bytes, I get 112, 111, 111. What would these char codes look like as a utf-8 encoded octet string? Would it be "0x70 0x6f 0x6f"?

The context of my question is the process of generating an openid signature as described in the openid spec: "The message MUST be encoded in UTF-8 to produce a byte string." I'm looking for an example of what this would look like.

Thanks

+2  A: 

No. UTF-8 characters can span multiple bytes. If you want to learn about UTF-8, you should start with its article on Wikipedia, which has a good description.

Billy ONeal
thanks. the wikipedia article was one of the first places I looked. I think I may just be hung up on terminology. if I run a string through php's utf8_encode, could the output be described as a utf8 encoded byte string? The context of my question is the process of generating an openid signature as described in the openid spec (http://openid.net/specs/openid-authentication-2_0.html#kvform): "The message MUST be encoded in UTF-8 to produce a byte string." I'm looking for an example of what this would look like.
erik
I'm unfamiliar with PHP's handling of UTF-8. Sorry.
Billy ONeal
utf8_encode(the_string) will give the correct result if the string is encoded in ISO-8859-1.
dan04
+1  A: 

I think you may have made some mistakes in encoding your example, but in any case, my guess is that the answer that you really need is the UTF-8 is a superset of ASCII (the standard way to encode characters into bytes).

So, if you give an ASCII encoded string into a function that expects a UTF-8 encoded string, it should work just fine.

However, the opposite isn't true at all. UTF-8 can represent a lot of character ASCII cannot, so giving a UTF-8 encoded string to a function that expects an ASCII (i.e. 'normal') string is dangerous (unless you're positive that all the characters are part of the ASCII subset).

smehmood
thanks. I'm trying to describe a need for something I don't fully understand. I'm writing a javascript openid signature generator. The openid spec states that the "message (to be signed) MUST be encoded in UTF-8 to produce a byte string." I'm looking for a language-agnostic example of what such a string would look like. thanks again.
erik
A: 

The string "foo" gets encoded as 66 6F 6F, but it's like that in nearly all ASCII derivatives. That's one of the biggest features of UTF-8: Backwards compatibility with 7-bit ASCII. If you're only dealing with ASCII, you don't have to do anything special.

Other characters are encoded with up to 4 bytes. Specifically, the bits of the Unicode code point are broken up into one of the patterns:

  • 0xxxxxxx
  • 110xxxxx 10xxxxxx
  • 1110xxxx 10xxxxxx 10xxxxxx
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

with the requirement of using the shortest sequence that fits. So, for example, the Euro sign ('€' = U+20AC = binary 10 000010 101100) gets encoded as 1110 0010, 10 000010, 10 101100 = E2 82 AC.

So, it's just a simple matter of going through the Unicode code points in a string and encoding each one in UTF-8.

The hard part is figuring out what encoding your string is in to begin with. Most modern languages (e.g., Java, C#, Python 3.x) have distinct types for "byte array" and "string", where "strings" always have the same internal encoding (UTF-16 or UTF-32), and you have to call an "encode" function if you want to convert it to an array of bytes in a specific encoding.

Unfortunately, older languages like C conflate "characters" and "bytes". (IIRC, PHP is like this too, but it's been a few years since I used it.) And even if your language does support Unicode, you still have to deal with disk files and web pages with unspecified encodings. For more details, search for "chardet".

dan04