views:

61

answers:

1

I have a situation where after several years of use we are suddenly have some JSON-encoded values that are giving our Perl script fits due to backslashes.

The issues are with accented characters like í and é. An example is Matí encoded as Mat\ud873.

It is unclear what may have changed in the environment. PHP, Perl, and MySQL are involved. The table collation is latin1_swedish_ci and this may have been changed by a co-worker screwing around.

Does this ring any bells for anyone?

+3  A: 

The problem here is internationalization on the JavaScript end, not the collation of your DB table. If you had no such problems before, it's likely that no users were inputting international characters before, or the character set of your HTML pages was ISO-8859-1/cp1252 (which would have limited form POST data on the client end.) New users or changed HTML headers could have caused this problem to manifest itself, but the issue is really on the side of the Perl script.

JSON defines strings as double-quoted sets of characters with Unicode escape sequences when more than a 7-bit encoding is necessary. The first 127 ISO-8859-1 characters can be represented as-is, but any extended-ASCII/multi-byte characters will end up as \uXXXX values. For example, character é (e-acute), which is #233 in ISO-8859-1 will show up as \u00E9 (since é is U+00E9 in Unicode), and the string "résumé" would be stored as "r\u00E9sum\u00E9".

Not knowing what your Perl script is attempting to do, all I can say is it may be experiencing difficulty when trying to de-reference the escape sequence. Perl has its own set of escape sequences, and \u mid-string actually means "make the next character upper-case", so you're probably seeing a lot of "00E9" stuff from your Perl script instead of the accented characters, or you may get parse errors depending on your script.

Since you're creating/storing the JSON from POST data in PHP, you have some options:

  1. Convert the special characters to HTML entities (htmlentities())
  2. Force all special characters to reduce from UTF-8 sequences (if that's what your POST data comes in) to ISO-8859-1 via utf8_decode() (you may lose data with this approach)
  3. Scrub the resultant JSON by replacing this REGEX match: /\\u[a-zA-Z0-9]{4,4}/ with "" (nothing) (you may lose data with this approach)
  4. Double-escape the resultant JSON by changing all "\" characters to "\\" before feeding it to your Perl script (be wary of SQL injection!)
Jay Dansand
I'm pretty sure this is due to an environmental change. There are 178000+ individual submissions from all over the world stored and I discovered the problem because it caused the Perl cron job to throw errors which it hadn't done in the previous three years. I was hoping to avoid applying a regex to the data but it may be that is my best option.
jerrygarciuh
Thank you very much for your time and trouble on your response. It definitely helped me think this through more.
jerrygarciuh
Forcing 8-bit encoding on the PHP end before JSON-ifying the data may solve the issue while mitigating the data loss. Of course, Perl can handle Unicode, so teaching it to convert \uXXXX references internally is the best option, and preserves all of your data.
Jay Dansand
Jay - Ignacio's comment above made me finally understand what Perl was complaining about. I though it was the escape but it's an issue with surrogate pairs. Should I delete this question and ask the new one?
jerrygarciuh
Yay! htmlentities() saved the day! I encode when I store and decode when I infalte and life is beautiful! Thank you!
jerrygarciuh