views:

55

answers:

1

I'm trying to do a bunch translating of html encoded text into utf-8 to put it into my database. There are a ton of characters that get missed with both html_entity_decode, or iconv with Translit.

I've written up a long list of characters to strip out, but now I see that &Yuml is not translated, but &yuml is.

I'm sure there are other similar symbols that are missed as well.

Any advice on how best to handle these inconsistencies? and make sure I'm getting each character translated correctly?

+1  A: 

Anything that is in the form &blah; is an entity reference in (X)HTML; if you need to be sure you got them all, make sure none of your final UTF-8 output contains that pattern. You'll also find plenty without the semicolon at the end (but many false positives there).

Wikipedia, naturally, has a list of HTML/XHTML/XML entity codes. You can implement that (long) list, and see if you find any additional ones in the wild.

derobert
thanks derobert,I was hoping there was a way to do this without going through such a long list (hoping something already existed). Looks like I'll be making the cleaning for that and I'll post the function for those who need it in the future.
pedalpete