Issues with stored JSON encoded POSTs in MySQL

The problem here is internationalization on the JavaScript end, not the collation of your DB table. If you had no such problems before, it's likely that no users were inputting international characters before, or the character set of your HTML pages was ISO-8859-1/cp1252 (which would have limited form POST data on the client end.) New users or changed HTML headers could have caused this problem to manifest itself, but the issue is really on the side of the Perl script.

JSON defines strings as double-quoted sets of characters with Unicode escape sequences when more than a 7-bit encoding is necessary. The first 127 ISO-8859-1 characters can be represented as-is, but any extended-ASCII/multi-byte characters will end up as \uXXXX values. For example, character é (e-acute), which is #233 in ISO-8859-1 will show up as \u00E9 (since é is U+00E9 in Unicode), and the string "résumé" would be stored as "r\u00E9sum\u00E9".

Not knowing what your Perl script is attempting to do, all I can say is it may be experiencing difficulty when trying to de-reference the escape sequence. Perl has its own set of escape sequences, and \u mid-string actually means "make the next character upper-case", so you're probably seeing a lot of "00E9" stuff from your Perl script instead of the accented characters, or you may get parse errors depending on your script.

Since you're creating/storing the JSON from POST data in PHP, you have some options:

Convert the special characters to HTML entities (htmlentities())
Force all special characters to reduce from UTF-8 sequences (if that's what your POST data comes in) to ISO-8859-1 via utf8_decode() (you may lose data with this approach)
Scrub the resultant JSON by replacing this REGEX match: /\\u[a-zA-Z0-9]{4,4}/ with "" (nothing) (you may lose data with this approach)
Double-escape the resultant JSON by changing all "\" characters to "\\" before feeding it to your Perl script (be wary of SQL injection!)

I'm pretty sure this is due to an environmental change. There are 178000+ individual submissions from all over the world stored and I discovered the problem because it caused the Perl cron job to throw errors which it hadn't done in the previous three years. I was hoping to avoid applying a regex to the data but it may be that is my best option.

jerrygarciuh 2010-10-25 21:42:22

Thank you very much for your time and trouble on your response. It definitely helped me think this through more.

jerrygarciuh 2010-10-25 21:42:55

Forcing 8-bit encoding on the PHP end before JSON-ifying the data may solve the issue while mitigating the data loss. Of course, Perl can handle Unicode, so teaching it to convert \uXXXX references internally is the best option, and preserves all of your data.

Jay Dansand 2010-10-25 21:51:52

Jay - Ignacio's comment above made me finally understand what Perl was complaining about. I though it was the escape but it's an issue with surrogate pairs. Should I delete this question and ask the new one?

jerrygarciuh 2010-10-25 22:02:44

Yay! htmlentities() saved the day! I encode when I store and decode when I infalte and life is beautiful! Thank you!

jerrygarciuh 2010-10-26 02:16:39

ansaurus

tags:

views:

answers:

Issues with stored JSON encoded POSTs in MySQL

related questions