I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users.
Even though my webapp uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
- How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
- How do you present the error in a helpful way to the user?
- How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
- For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
EDIT: I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP". I'd like advice from people with experience in real-world situations how they've handled this.
EDIT2: As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD