views:

6779

answers:

2

Say you've loaded a text file into a string and you'd like to convert all unicode escapes into actual unicode characters inside of the string.

Example:

"The following is the top half of an integral character in unicode '\u2320', and this is the lower half '\U2321'."

I found an answer that works for me and if follows.

+8  A: 

This is the answer that I came up with. It's simple and works well with strings up to at least severl thousand characters.

Example 1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rxx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() );

Example 2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

The first example shows the repacement being made using a Lambda Expression (C# 3.0) and the second uses a delegate which should work with C# 2.0.

To break down what's going on here, first we create a regular expression:

new Regex( @"\\[uU]([0-9A-F]{4})" );

Then we call Replace() with the string 'result' and an anonymous method (Lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string.

The unicode escape is processed like this:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

Get the string representing the number part of the escape (skip the first two characters).

      match.Value.Substring(2)

Parse that string using Int32.Parse() which takes the string and the number format that Parse() function should expect which in this case is a hex number.

      NumberStyles.HexNumber

Then we cast the resulting number to a unicode character

      (char)

and finaly we call ToString() on the unicode character which gives us it's string representation which is the value passed back to Replace()

      .ToString()

Note, instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320') but that's more complicated and less readable.

jr
\u and \U should be treated differently -- \u specifies 4 hex digits (16 bits), where \U specifies 8 (32 bits) -- a unicode codepoint is 21 bits long. Also, you should use the char.ConvertFromUtf32() method rather than a cast.
Alex Lyman
I've seen \u and \U documented both ways though the current C# language specification indicates 4 hex bytes for \u and 8 hex bytes for \U. In any case, \U with only 4 hex digits is processed correctly. Have to check if ConvertFromUtf32() is functionally different from a cast.
jr
Yeah, I read the ignorecase option in the second part of the post after realising myself. Thanks all the same. :)
Echilon
A: 

Refactored a little:

Regex regex = new Regex (@"\\[uU]([0-9A-F]{4})", RegexOptions.IgnoreCase);
string line = "...";
line = regex.Replace (line, match => ((char)int.Parse (match.Groups[1].Value,
  NumberStyles.HexNumber)).ToString ());
George Tsiokos