views:

182

answers:

3

I have a (Wordpress) blog and some of my older posts have a character encoding problem where £ displays as £ (i.e. a pound sign prepended with a capital 'A' with a hat on).

The problem is at the DB level, so I was going to run the following SQL statement:

update wp_posts set post_content = replace(post_content, ‘£’, ‘£’);

Would this be foolish?


Background info (not required to read):

How did this problem happen? I don't know. The blog has been though various updates (including from Wordpress Version 2.1.3 when the default table CHARSET changed from latin1 to utf8) and been migrated to and from various machines and I guess at some point Wordpress must have written UTF-8 encoded characters into the Database that had a CHARSET of latin1, or vice-versa. I know I should have been more careful (yes I have read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).

How have I ensured that this doesn't happen again? I have made sure my encodings are consistent. All MySQL tables use CHARSET utf-8 and the HEAD section of blog pages set <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

+2  A: 

It should be ok. The best thing is the following:

  • Make a dump of your blog db
  • Load it to another db
  • Perform the replace on the temporary db
  • Check!
  • If all goes well, perform it on the production db as well.
David Rabinowitz
A: 

Don't do that! Use a trigger on update/insert if you really need to.

EDIT: hmm, after reading your situation, I would suggest making a backup copy of the DB and trying what you said. I think it would work, as long as you're not planning to ever do it again (which seems to be the case)

rmn
+2  A: 

Well, I would say that it would probably be the best "solution" to the problem.

As the data has been stored using the wrong encoding somewhere along the line, the original data is lost and there is no real solution. You just have to try to salvage what you can from the corrupt data that you have.

If it's only isolated to a single character, you are lucky. There may be byte codes that didn't translate into any available character, so if that happened anywhere you wouldn't have a character combination that is possible to identify, you would just have a character replaced by another or a missing character. It would only be possible to spot that manually.

Guffa