A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)
The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.
Before outputting the text from the file, I go over several different layers to try to clean it up:
$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);
// BOM Fun
$boms = array
(
"utf8" => array(3,pack("CCC",0xEF,0xBB,0xBF)),
"utf16be" => array(2,pack("CC",0xFE,0xFF)),
"utf16le" => array(2,pack("CC",0xFF,0xFE)),
"utf32be" => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
"utf32le" => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
"gb18030" => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
);
foreach($boms as $bom)
{
if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
{
$txtfile = substr($txtfile,$bom[0]);
break;
}
}
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");
The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.
This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.
This code does not work (for no particular reason) when the uploaded file is UTF-8.
When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.
This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.
Any and all help on this would be greatly appreciated.