views:

33

answers:

2

A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)

The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.

Before outputting the text from the file, I go over several different layers to try to clean it up:

$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);

// BOM Fun
    $boms = array
    (
        "utf8"    => array(3,pack("CCC",0xEF,0xBB,0xBF)),
        "utf16be"       => array(2,pack("CC",0xFE,0xFF)),
        "utf16le"       => array(2,pack("CC",0xFF,0xFE)),
        "utf32be"       => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
        "utf32le"       => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
        "gb18030"       => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
    );
    foreach($boms as $bom)
    {
        if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
        {
            $txtfile = substr($txtfile,$bom[0]);
            break;
        }
    }
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");

The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.

This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.

This code does not work (for no particular reason) when the uploaded file is UTF-8.

When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.

This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.

Any and all help on this would be greatly appreciated.

A: 

The characters you are trying to replace have different byte values in UTF8. Actually, they have more than one byte each in UTF8. You are trying to search for them with Windows encoding values and that's why you won't find them.

Look up the UTF8 byte sequences of the characters and use them for the search.

Stephen Chu
+1  A: 

If I understand correctly your problem is that your code that replaces "extended ASCII" characters for their ASCII counterparts fails when the user submits a file in UTF-8.

This was to be expected. You cannot operate on UTF-8 files with str_replace and the like, which operate at the byte level, while a character in UTF-8 is constituted by one byte only for characters in the ASCII range.

What I'd recommend you to do is to use some heuristic to determine if the file is encoded in UTF-8 (the BOM is a good way if you're sure it'll be present) or Windows-1252 or whatever and then convert it to UTF-8 if it isn't. In that case, you wouldn't need to replace any characters, you could preserve the smart quotes.

Artefacto
The problem was actually that the mb_convert($string,"UTF-8"); actually screws up the syntax if you are passing it a UTF-8 string. It can't convert UTF-8 to UTF-8 without horrible results.
Navarr