views:

257

answers:

3

I have a database the I am rebuilding the table structure was crap so I'm porting some of the data from one table to another. This data appears to have been copy-pasted from MSO product so as I'm getting the data I clean it up with htmlpurifier and some str_replace in php. Here is the clean function:

   function clean_html($html) {
    $config = HTMLPurifier_Config::createDefault();
    $config->set('AutoFormat','RemoveEmpty',true);
    $config->set('HTML','AllowedAttributes','href,src');
    $config->set('HTML','AllowedElements','p,em,strong,a,ul,li,ol,img');
    $purifier = new HTMLPurifier($config);

    $html = $purifier->purify($html);

    $html = str_replace(' ',' ',$html);
    $html = str_replace("\r",'',$html);
    $html = str_replace("\n",'',$html);
    $html = str_replace("\t",'',$html);
    $html = str_replace('  ',' ',$html);
    $html = str_replace('<p> </p>','',$html);
    $html = str_replace(chr(160),' ',$html);

    return trim($html);
   }

However, when I put the results into my new table and output them to the ckeditor I get those three characters.

I then have a javascript function that is called to remove special characters from the content of the ckeditor too. it doesn't clean it either

  function remove_special(str) {
    var rExps=[ /[\xC0-\xC2]/g, /[\xE0-\xE2]/g,
    /[\xC8-\xCA]/g, /[\xE8-\xEB]/g,
    /[\xCC-\xCE]/g, /[\xEC-\xEE]/g,
    /[\xD2-\xD4]/g, /[\xF2-\xF4]/g,
    /[\xD9-\xDB]/g, /[\xF9-\xFB]/g,
    /\xD1/,/\xF1/g,
    "/[\u00a0|\u1680|[\u2000-\u2009]|u200a|\u200b|\u2028|\u2029|\u202f|\u205f|\u3000|\xa0]/g", 
    /\u000b/g,'/[\u180e|\u000c]/g',
    /\u2013/g, /\u2014/g,
    /\xa9/g,/\xae/g,/\xb7/g,/\u2018/g,/\u2019/g,/\u201c/g,/\u201d/g,/\u2026/g];
    var repChar=['A','a','E','e','I','i','O','o','U','u','N','n',' ','\t','','-','--','(c)','(r)','*',"'","'",'"','"','...'];

    for(var i=0; i<rExps.length; i++) {
        str=str.replace(rExps[i],repChar[i]);
    }

      for (var x = 0; x < str.length; x++) {
    charcode = str.charCodeAt(x);
          if ((charcode < 32 || charcode > 126) && charcode !=10 && charcode != 13) {
              str = str.replace(str.charAt(x), "");
          }
      }
      return str;
  }

Does anyone know off hand what I need to do to get rid of them. I think they may be some sort of quote.

+1  A: 

The first answer in this SO thread should point you in the right direction and simplify your remove_special() function as well.

John Conde
+1  A: 

Had a similar issue: http://stackoverflow.com/questions/2298204/php-remove-identify-this-symbol

The character � is the REPLACEMENT CHARACTER (U+FFFD). It is used when there was an error within an UTF code:

FFFD � REPLACEMENT CHARACTER

 - used to replace an incoming character whose value 
   is unknown or unrepresentable in Unicode

In most cases it means that some data is interpreted with an UTF encoding while the data is not encoded with that encoding but a different one.

My problem was pasting text from microsoft office products to html, or into a database. The largest offenders seem to be the emdash and smart quotes.

Phill Pafford
what did you do to fix it
mcgrailm
The obvious fix would be using UTF-8 over the entire pipeline.
BalusC
wouldn't that mean that the input into the original table would have needed to be encoded with UTF-8 to begin with ?
mcgrailm
http://php.net/manual/en/function.htmlentities.php
Phill Pafford
+1  A: 

Your character encodings are all out of whack. � is indicative to me of a three-byte UTF-8 encoded character.

Some things you need to discover

  • What is was the encoding of the old table?
  • What is the encoding of the new table?
  • What is the encoding of the page that displays ckeditor?

It looks like HTMLPurifier's default is UTF-8 so you really need to be aware of the encoding of your data!

Peter Bailey