tags:

views:

88

answers:

3

I'm working in C# doing some OCR work and have extracted the text I need to work with. Now I need to parse a line using Regular Expressions.

string checkNum;
string routingNum;
string accountNum;
Regex regEx = new Regex(@"\u9288\d+\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\u9286\d{9}\u9286");
match = regEx.Match(numbers);
if(match.Success)
    routingNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\d{10}\u9288");
match = regEx.Match(numbers);
if (match.Success)
    accountNum = match.Value.Remove(match.Value.Length - 1, 1);

The problem is that the string contains the necessary Unicode characters when I do a .ToCharArray() and inspect the contents of the string, but it never seems to recognize the Unicode characters when I parse the string looking for them. I thought strings in C# were Unicode by default.

A: 

String in .NET are UTF-16 encoded.

Additionally, Regex engines don't match against Unicode characters, but against Unicode code points. See this post.

Doug
+1  A: 

This line:

match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);

causes an exception because the resulting length from the first Remove is smaller than the original match.Value.Length.

I suggest you use groups to extract the value. Ex:

Regex regEx = new Regex(@"\u9288(\d+)\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Groups[1].Value;

With that, I can extract the values correctly.

bruno conde
+1  A: 

I figured it out. I was using the decimal values instead of the hex code In other words instead of using \u9288 and \u9286 I should have been using \u2448 and \u2446 http://www.ssec.wisc.edu/~tomw/java/unicode.html#x2440

Thanks guys for leading me in the right direction.

Marcus King
You should accept this answer to keep the thread from getting automatically revived every few months. And use the `regex` tag instead of variants like 'regularexpressions'--it's the one regex specialists look for (though it turned out this wasn't really a regex question after all).
Alan Moore