I am trying to debug a nasty utf-8 problem, and do not know where to start.
A page contains the word 'categorieën', wich should be categorieën. Clearly something is wrong with the UTF-8. This happens with all these multibite characters. I have scanned the gazillion topics here on UTF8, but they mostly cover the basics, not this situation where everything appears to be configured and set correct, but clearly is not.
The pages are served by Drupal, from a MySQL database.
The database was migrated (not by me) by sql-dumping and -importing trough phpmyadmin. Good chance something went wrong there, because before, there was no problem. And because the problem occurs only on older, imported items. Editing these items or inserting new ones, and fixxing the wrongly encoded characters by hand, fixes the problem. Though I cannot see a difference in the database.
- Content re-edited trough Drupal does not have this problem.
- When, on the CLI, using MySQL, I can read out that text and get the correct ë character. On both The articles that render "correct "and "incorrect" characters.
- The tables have collation
utf8_general_ci
- Headers appear to be sent with correct encoding:
Vary Accept-Encoding
andContent-Type text/html; charset=utf-8
- HTML head contains a
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
- the HTTP headers tell me there is a Varnish proxy inbetween. Could that cause UTF8-conversion/breakage
- content is served Gzipped, normal in Drupal, and I have never seen this UTF8 issie wrt the gzipping, but you never know.
It appears the import is the culprit and I would like to know a) what went wrong. b) why I cannot see a difference in the mysql cli client between "wrong" and "correct" characters c) how to fix the database, or where to start looking and learning on how to fix it.