views:

39

answers:

2

I'm getting crazy over these encoding probs...

I use json_decode and json_encode to store and retrieve data. What I did find out is, that json always needs utf-8. No problem there. I give json 'hellö' in utf-8, in my DB it looks like hellu00f6. Ok, codepoint. But when I use json_decode, it won't decode the codepoint back, so I still have hellu00f6. Also, in php 5.2.13 it seems like there are still no optionial tags in JSON. How can I convert the codepoint caracters back to the correct specialcharacter for display in the browser?

Greetz and thanks

Maenny

+1  A: 

It could be because of the backslash preceding the codepoint in the JSON unicode string: ö is represented \u00f6. When stored in your DB, the DBMS doesn't knows how to interpret \u00f6 so I guess it reads (and store) it as u00f6.

Are you using an escaping function ?

Try adding a backslash on unicode-escaped chars:

$json = str_replace("\\u", "\\\\u", $json);
streetpc
I get this error message: Warning: preg_replace() [function.preg-replace]: Unknown modifier 'g'
Maenny
anymway, you are right: The query used to write to the DB has the backslashes, the string in the DB has not, they get lost somewhere in between...
Maenny
I removed the g, it's a bad habit I get from javascript
streetpc
You might want to check whether apostrophes pass in. You should use an escaping function or prepared statements.
streetpc
Sorry, but with regex I am really bad, this is why I can't figure out, what the problem is. The new error message:preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
Maenny
Try with str_replace. Actually there was no reason to go for a regex.
streetpc
ok I got it to work by using str_replace('\\', '\\\\', $json) BEFORE wiriting to db. Anyway, this is really strange, why would I loose one backslash?
Maenny
I guess the DBMS tries to interpret it. That's why prepared statement and escape functions are useful, they avoid this kind of trouble by knowing the DBMS's dialect. If you are using MySQL, the quickest is to use http://fr.php.net/manual/en/function.mysql-real-escape-string.php on your strings (that takes care of the `\\ `, but also `\\x00, \\n, \\r, ', " and \\x1a`), `intval` on integers… And a prepared statement does all this for you and provide an abstraction over whatever DBMS you are using (though you need the driver for your DBMS on your PHP config).
streetpc
A: 

The preceding post already explains, why your example did not work as expected. However, there are some good coding practices when working with databases, which are important to improve the security of your application (i.e. prevent SQL-injection).

The following example intends to show some of these practices, and assumes PHP 5.2 and MySQL 5.1. (Note that all files and database entries are stored using UTF-8 encoding.)

The database used in this example is called test, and the table was created as follows:

CREATE TABLE `test`.`entries` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`data` VARCHAR( 100 ) NOT NULL
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin 

(Note that the encoding is set to utf8_bin.)

It follows the php code, which is used for both, adding new entries and creating JSON:

<?
$conn = new PDO('mysql:host=localhost;dbname=test','root','xxx');
$conn->exec("SET NAMES 'utf8'"); // Enable UTF-8 charset for db-communication ..

if(isset($_GET['add_entry'])) {
    header('Content-Type: text/plain; charset=UTF-8');
    // Add new DB-Entry:
    $data = $conn->quote($_GET['add_entry']);
    if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {
        $id = $conn->lastInsertId();
        echo 'Created entry '.$id.': '.$_GET['add_entry'];
    } else {
        $info = $conn->errorInfo();
        echo 'Unable to create entry: '. $info[2];
    }
} else {
    header('Content-Type: text/json; charset=UTF-8');
    // Output DB-Entries as JSON:
    $entries = array();
    if($res = $conn->query('SELECT * FROM `entries`')) {
        $res->setFetchMode(PDO::FETCH_ASSOC);
        foreach($res as $row) {
            $entries[] = $row;
        }
    }
    echo json_encode($entries);
}
?>

Note the usage of the method $conn->quote(..) before passing data to the database. As mentioned in the preceding post, it would even be better to use prepared statements, since they already do the whole escaping. Thus, it would be better if we write:

$prepStmt = $conn->prepare('INSERT INTO `entries` (`data`) VALUES (:data)');
if($prepStmt->execute(array('data'=>$_GET['add_entry']))) {...}

instead of

$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {...}

Conclusion: Using UTF-8 for all character data stored or transmitted to the user is reasonable. It makes the development of internationalized web applications way easier. To make sure, user-input is properly sent to the database, using an escape function is a good idea. Otherwise, using prepared statements make life and development even easier and furthermore improves your applications security, since SQL-Injection is prevented.

Javaguru