tags:

views:

1592

answers:

2
Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" )

Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.

Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.

So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?

+5  A: 

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

Jon Skeet
Actually, you're right. From what I've found, \u only supports 4 hex digits (exactly 4, not more not less), \uFFFF is the maximum. I've deleted my "solution" because while it does not produce an error, it does not seem to be a valid unicode regex. I still believe that the \ needs to be escaped.
Michael Stum
Without the @ you would need to escape \ if \UFFFF were regex syntax (like \d for [0-9]), but instead it is string literal syntax (like \n for the new-line character).
Christoph Rüegg
A: 

@Jon Skeet

So what you are telling me is that there is not a way to use the Regex tools in .net to match on chars outside of the utf-16 range?

The full regex is:

^(\u0009|[\u0020-\u007E]|\u0085|[\u00A0-\uD7FF]|[\uE000-\uFFFD]|[\U00010000-\U0010FFFF])+$

I am attempting to check if a string only contains what a yaml document defines as printable Unicode chararters.

kazakdogofspace
I don't know, unfortunately. I can't see anything in the documentation about how to use the .NET regular expression engine with characters outside the basic multilingual plane. However, it's probably not too hard to implement what you want without using regular expressions at all.
Jon Skeet
Alternatively, you could use the 16-bit code points which make up the surrogate pairs: [\ud800–\udfff]. It's at least worth trying that...
Jon Skeet
(And at that point, you can combine several of your ranges together - the later bits are just [\u00a0-\ufffd].)
Jon Skeet
The .NET regular expression engine operates on UTF-16 code points, not Unicode characters; see http://code.logos.com/blog/2008/07/net_regular_expressions_and_unicode.html. I've filed a Connect bug on this issue: https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=357780
Bradley Grainger