tags:

views:

246

answers:

2

I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.

Any thoughts how to port the regexp to MySQL? Or maybe you know any other way to check if the string is valid UTF-8?

UPDATE: I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.

A: 

If you are in control of both the input and output side of this DB then you should be able to verify that your data is UTF-8 on whichever side you like and implement constraints as necessary. If you are dealing with a system where you don't control the input side then you are going to have to check it after you pull it out and possibly convert in your language of choice (Perl it sounds like).

The database is a REALLY good storage facility but should not be used aggressively for other applications. I think this is one spot where you should just let the MySQL hold the data until you need to do something further with it.

If you want to continue on the path you are on then check out this MySQL Manual Page: http://dev.mysql.com/doc/refman/5.0/en/regexp.html

REGEX is normally VERY similar between languages (in fact I can almost always copy between JavaScript, PHP, and Perl with only minor adjustments for their wrapping functions) so if that is working REGEX then you should be able to port it easily.

GL!

EDIT: Look at this Stack article--you might want to use Stored Procedures considering you cannot using scripting to handle the data: http://stackoverflow.com/questions/1650018/regular-expressions-in-stored-procedures

With Stored Procedures you can loop through the data and do a lot of handling without ever leaving MySQL. That second article is going to refer you right back to the one I listed though so I think you need to first prove out your REGEX and get it working, then look into Stored Procedures.

angryCodeMonkey
I can't port the hex representation of characters like \x61 see the question i'm referring to for more details
Piotr Czapla
@Piotr Czapla - can you give me an idea of what data you are trying to store and why you need to store it in hex? That might help me or someone else figure it out. What type of field are you using to store this? (text|binary|varchar). Also I am extending my answer--see if this helps.
angryCodeMonkey
@Shame I don't store any data in hex.I repairing a broken database I need to check if that the repairing transformation is working correctly. To do so I'd like to port the regexp quoted in the question to mysql. The regexp use hex representation of characters to define ranges of values that are valid utf characters. And mysql don't support hex representation. I need to find a way to work around this.
Piotr Czapla
+1  A: 

I've managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.

I've used the fact that mysql changes the bytes that aren't utf-8 to '?' when converting to latin1.

Here is how the check looks like:

SELECT (
         CONVERT(
           CONVERT(
              potentially_broken_column 
           USING latin1) 
         USING utf8))
       != 
       potentially_broken_column) AS INVALID ....
Piotr Czapla