ansaurus

Question

Regular expression of unicode characters on string

Answer 1

A:

String in .NET are UTF-16 encoded.

Additionally, Regex engines don't match against Unicode characters, but against Unicode code points. See this post.

Doug 2010-05-14 15:08:23

Answer 2

+1 A:

This line:

match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);

causes an exception because the resulting length from the first Remove is smaller than the original match.Value.Length.

I suggest you use groups to extract the value. Ex:

Regex regEx = new Regex(@"\u9288(\d+)\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Groups[1].Value;

With that, I can extract the values correctly.

bruno conde 2010-05-14 15:21:28

Answer 3

+1 A:

I figured it out. I was using the decimal values instead of the hex code In other words instead of using \u9288 and \u9286 I should have been using \u2448 and \u2446 http://www.ssec.wisc.edu/~tomw/java/unicode.html#x2440

Thanks guys for leading me in the right direction.

Marcus King 2010-05-14 16:23:48

You should accept this answer to keep the thread from getting automatically revived every few months. And use the `regex` tag instead of variants like 'regularexpressions'--it's the one regex specialists look for (though it turned out this wasn't really a regex question after all).

Alan Moore 2010-05-15 09:57:58

ansaurus

tags:

views:

answers:

Regular expression of unicode characters on string

related questions