views:

42

answers:

2

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.

What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.

+1  A: 

Normally, this would be the way to convert between bytes and Unicode text:

// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);

// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);

EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:

// string from bytes
Convert.ToBase64String(byteArray);

// bytes from string
Convert.FromBase64String(base64Encoded);

(Thanks to @Timwi who pointed this out!)

Venemo
Thanks! I'm trying to keep my brain sharp while I'm on painkillers from my motorcycle injury. I *think* I should have known this. Just perfect
MakerOfThings7
@MakerOfThings7 - Don't worry, it was unknown to me too until I actually needed it. :)
Venemo
**This answer is completely wrong.** If you use this, you will lose data. `Encoding.Unicode` encapsulates UTF-16, and not all byte arrays are valid UTF-16. Consider arrays with odd numbers of bytes, or byte sequences with lone surrogates, for example. Neither are valid UTF-16 and would generate a string that doesn’t turn back into the original byte array.
Timwi
@Timwi - Thanks, I did not think about it. If ASCII encoding would fix it, wouldn't it be also a solution to use `System.Text.Encoding.ASCII`? In theory, every byte has an ASCII counterpart, right?
Venemo
@Venemo: No, of course not — half of all bytes are not valid ASCII characters! The encodings in `System.` **`Text`** `.Encoding` are meant to encode **text** as the name implies. You should use an encoding that is *designed for arbitrary byte data*. Base64 is an example of that.
Timwi
@Timwi - When looking at an ASCII code table, there seem to be 256 possible values for an ASCII character, which correspond to the number of possible `byte` values. Anyways, I edited my answer and corrected it. I am ashamed of making such a mistake. :(
Venemo
@Venemo: Then you are looking at a code table that **doesn’t represent ASCII**. Just run `Encoding.ASCII.GetString(new byte[] { 63 })` and then `Encoding.ASCII.GetString(new byte[] { 129 })` (hint: you get the same answer for both). You are looking at one that perhaps represents Latin-1 (ISO-8859-1) or Windows-1252. However, *even in those not all 256 possible values have a valid character*. The non-Unicode encodings turn several possible bytes values into question marks.
Timwi
@Timwi - I was looking at http://www.ascii-code.com/ - According to this, both `?` and `á` are valid ASCII characters. Anyways, as I said, I'm glad you pointed out this mistake! :)
Venemo
@Venemo: Well that website is wrong. It shows the [Windows-1252](http://en.wikipedia.org/wiki/Windows-1252) character set, not [ASCII](http://en.wikipedia.org/wiki/ASCII).
Timwi
@Timwi - Good to know. Thanks!
Venemo
+3  A: 

There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.

My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:

var base64Encoded = Convert.ToBase64String(byteArray);

var original = Convert.FromBase64String(base64Encoded);

In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.

Timwi
+1 Thanks for pointing out the mistake in my answer!
Venemo
+1 for the Nutella love letter...romantic AND delicious!
Drew Hall
Just to be fair to MSFT, there are other binary properties that I could use but the client wants me to use "extension attributes" which are Unicode. There are Byte[] in other spots too. I like Nutella love letters. +1
MakerOfThings7