Hello,
This Perl binary regex found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php matches UTF-8 documents without the UTF-8 BOM header:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
I need this because I am working on a PowerShell equivalent to 'grep -I', and part of this involves detecting text encoding.
But how do I rewrite this in C# or PowerShell? Or in other words, in ".Net Regex" syntax?
EDIT: Found this http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 question about the same Regex of all things. The short answer seems like this can not be done with .Net since .Net does not support binary regular expressions.