views:

760

answers:

4

Hello,

This Perl binary regex found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php matches UTF-8 documents without the UTF-8 BOM header:

$field =~
m/\A(
 [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x;

I need this because I am working on a PowerShell equivalent to 'grep -I', and part of this involves detecting text encoding.

But how do I rewrite this in C# or PowerShell? Or in other words, in ".Net Regex" syntax?

EDIT: Found this http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 question about the same Regex of all things. The short answer seems like this can not be done with .Net since .Net does not support binary regular expressions.

A: 

What specifically are you trying to do?

You should be able to use the System.Text.Encoding class.

SLaks
I don't see how to *detect* the encoding of a binary stream using this class. The regular expression in the question matches true if the binary stream is UTF-8 encoded.
kervin
kervin: You can try parsing the stream as UTF-8. If it fails, then it wasn't UTF-8, otherwise it was.
Joey
+1  A: 

Try this: (I haven't checked that it matches correctly; you can easily try it in LINQPad).

new Regex(@"
 ^(
 [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
 )*$", RegexOptions.IgnorePatternWhitespace)

EDIT:

Try reading your file using an ASCII StreamReader; that should do what you're looking for. (Note that I didn't actually try it)

SLaks
The Perl regex is a binary regex. So this will not work. After more research it doesn's seem that .Net supports binary regular expressions.
kervin
You can fake "binary" regex matching by decoding the byte stream in such a way that each byte is converted to a character with the same numeric value. Just use ISO-8859-1.
Alan Moore
+1  A: 

This post at http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 describes several workarounds.

kervin
See my edit.
SLaks
+1  A: 

The odds are pretty good that if a sequence has no invalid UTF-8 characters, it can be treated as UTF-8. Since RegExps are for text in .Net, not byte arrays, here's a non-regexp solution that should work. Personally, I'd rather use this as a fallback mechanism (e.g. mycommand -autodetect) and offer pipeline parameters that allow user-specified encodings.

       string result=String.Empty;
        Encoding ae = Encoding.GetEncoding(
              Encoding.UTF8.EncodingName,
              new EncoderExceptionFallback(), 
              new DecoderExceptionFallback());
        try {
            result=ae.GetString(mybytes);
        }
        catch (DecoderFallbackException e)
        {
            //revert to some sensible default. Maybe the Ansi Code page for this environment?
            // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks.
            result=Encoding.Default.GetString(mybytes);
        }

If you can interact with unmanaged code, research the MLANG dll that ships with IE. It has alternate encoding autodetection methods that may be more useful.

JasonTrue