ansaurus

Question

C# Regular Expressions with \Uxxxxxxxx characters in the pattern.

Answer 1

+5 A:

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

Jon Skeet 2008-12-12 20:24:24

Actually, you're right. From what I've found, \u only supports 4 hex digits (exactly 4, not more not less), \uFFFF is the maximum. I've deleted my "solution" because while it does not produce an error, it does not seem to be a valid unicode regex. I still believe that the \ needs to be escaped.

Michael Stum 2008-12-12 20:35:05

Without the @ you would need to escape \ if \UFFFF were regex syntax (like \d for [0-9]), but instead it is string literal syntax (like \n for the new-line character).

Christoph Rüegg 2008-12-12 20:40:52

Answer 2

A:

@Jon Skeet

So what you are telling me is that there is not a way to use the Regex tools in .net to match on chars outside of the utf-16 range?

The full regex is:

^(\u0009|[\u0020-\u007E]|\u0085|[\u00A0-\uD7FF]|[\uE000-\uFFFD]|[\U00010000-\U0010FFFF])+$

I am attempting to check if a string only contains what a yaml document defines as printable Unicode chararters.

kazakdogofspace 2008-12-12 20:41:05

I don't know, unfortunately. I can't see anything in the documentation about how to use the .NET regular expression engine with characters outside the basic multilingual plane. However, it's probably not too hard to implement what you want without using regular expressions at all.

Jon Skeet 2008-12-12 20:52:53

Alternatively, you could use the 16-bit code points which make up the surrogate pairs: [\ud800–\udfff]. It's at least worth trying that...

Jon Skeet 2008-12-12 20:54:17

(And at that point, you can combine several of your ranges together - the later bits are just [\u00a0-\ufffd].)

Jon Skeet 2008-12-12 20:54:53

The .NET regular expression engine operates on UTF-16 code points, not Unicode characters; see http://code.logos.com/blog/2008/07/net_regular_expressions_and_unicode.html. I've filed a Connect bug on this issue: https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=357780

Bradley Grainger 2008-12-14 00:55:00

ansaurus

tags:

views:

answers:

C# Regular Expressions with \Uxxxxxxxx characters in the pattern.

related questions