views:

2175

answers:

5

Answers provided have all been great, I mentioned in the comments of Alnitak's answer that I would need to go take a look at my CSV Generation script because for whatever reason it wasn't outputting UTF-8.

As was correctly pointed out, it WAS outputting UTF-8 - the problem existed with Ye Olde Microsoft Excel which wasn't picking up the encoding the way I would have liked.

My existing CSV generation looked something like:

// Create file and exit;
$filename = $file."_".date("Y-m-d_H-i",time());
header("Content-type: application/vnd.ms-excel");
header("Content-disposition: csv" . date("Y-m-d") . ".csv");
header( "Content-disposition: filename=".$filename.".csv");
echo $csv_output;

It now looks like:

// Create file and exit;
$filename = $file."_".date("Y-m-d_H-i",time());
header("Content-type: text/csv; charset=ISO-8859-1");
header("Content-disposition: csv" . date("Y-m-d") . ".csv");
header("Content-disposition: filename=".$filename.".csv");

echo iconv('UTF-8', 'ISO-8859-1', $csv_output);

-------------------------------------------------------

ORIGINAL QUESTION

Hi,

I've got a form which collects data, form works ok but I've just noticed that if someone types or uses a '£' symbol, the MySQL DB ends up with '£'.

Not really sure where or how to stop this from happening, code and DB information to follow:

MySQL details

mysql> SHOW COLUMNS FROM fraud_report;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | mediumint(9) |      | PRI | NULL    | auto_increment |
| crm_number   | varchar(32)  | YES  |     | NULL    |                |
| datacash_ref | varchar(32)  | YES  |     | NULL    |                |
| amount       | varchar(32)  | YES  |     | NULL    |                |
| sales_date   | varchar(32)  | YES  |     | NULL    |                |
| domain       | varchar(32)  | YES  |     | NULL    |                |
| date_added   | datetime     | YES  |     | NULL    |                |
| agent_added  | varchar(32)  | YES  |     | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+
8 rows in set (0.03 sec)

PHP Function

function    processFraudForm($crm_number, $datacash_ref, $amount, $sales_date, $domain, $agent_added) {

    // Insert Data to DB
    $sql    = "INSERT INTO fraud_report (id, crm_number, datacash_ref, amount, sales_date, domain, date_added, agent_added) VALUES (NULL, '$crm_number', '$datacash_ref', '$amount', '$sales_date', '$domain', NOW(), '$agent_added')";
    $result = mysql_query($sql) or die (mysql_error());

    if ($result) {
        $outcome = "<div id=\"success\">Emails sent and database updated.</div>";
    } else {
        $outcome = "<div id=\"error\">Something went wrong!</div>";
    }

    return $outcome;
}

Example DB Entry

+----+------------+--------------+---------+------------+--------------------+---------------------+------------------+
| id | crm_number | datacash_ref | amount  | sales_date | domain             | date_added          | agent_added      |
+----+------------+--------------+---------+------------+--------------------+---------------------+------------------+
| 13 | 100xxxxxxx | 10000000     | £10.93 | 18/12/08   |  blargh.com        | 2008-12-22 10:53:53 | agent.name |
+6  A: 

What you're seeing is UTF-8 encoding - it's a way of storing Unicode characters in a relatively compact format.

The pound symbol has value 0x00a3 in Unicode, but when it's written in UTF-8 that becomes 0xc2 0xa3 and that's what's stored in the database. It seems that your database table is already set to use UTF-8 encoding. This is a good thing!

If you pull the value back out from the database and display it on a UTF-8 compatible terminal (or on a web page that's declared as being UTF-8 encoded) it will look like a normal pound sign again.

Alnitak
Thanks, learn something new everyday, now to figure out how to make my CSV generation script output UTF-8 :)
foxed
if you're outputting CSV and it contains the C2A3 sequence, it already *is* UTF-8 - whatever reads the CSV needs to understand it or be told to expect UTF-8.
Paul Dixon
The pound sign is 0xA3 in an Ansi code page and 0x00A3 in Unicode, which is why the character appeared mostly normal in the DB. But in general, the UTF-8 version of non-ASCII characters doesn't work out so nicely.
Rob Kennedy
indeed - it's a coincidence that the second byte of the UTF-8 encoding of 0xA3 also happens to be 0xA3 !
Alnitak
Not really a coincidence if you look at how the encoding works. Any 8 bit character in decimal range 128-191 has the first two binary digits as 10, which is the same as the UTF-8 lead bits, thus char n in that range is encoded as 0xC2 n
Paul Dixon
foxed
ok, it's not a *real* coincidence, yes anything from 128 - 191 would do that. The coincidence is that the OP happened to see one of that relatively small set of characters where this is true.
Alnitak
+4  A: 

£ is 0xC2 0xA3 which is the UTF-8 encoding for £ symbol - so you're storing it as UTF-8, but presumably viewing it as Latin-1 or something other than UTF-8

It's useful to know how to spot and decode UTF-8 by hand - check the wikipedia page for info on how the encoding works:

  • 0xC2A3 = 110 00010 10 100011
  • The bold parts are the actual "payload", which gives 10100011, which is 0xA3, the pound symbol.
Paul Dixon
Thanks, good answer which helped me understand the why - Alnitak got the Accepted because it helped me understand the how-to-fix! +1upvote
foxed
+1  A: 

You need to serve your HTML in utf-8 encoding (actually everyone needs to do this I think!) Header like:

Content-Type: text/html; charset=UTF-8

Or the equivalent. Double check the details though. Should always be declaring the charset as a browser can default to anything it likes.

A: 

Thanks a lot. I had been suspecting mysql for being currupting the pound symbol. Now all i need to do is wherever the csv record is generated, just use wrap them incov funciton. Though this is a good job, I am happy, at least someone showed exactly what to do. I sincerly appreciate dislaying the previous and the new 'header' values. It was a great help to me.

-mark

A: 
Sunil Rajput
that's worst approach ever.
Col. Shrapnel
Trying to answer something which is already answered around two years ago seems a little, odd.
foxed