tags:

views:

48

answers:

2

Hi all,

My problem is this:

I am copying a set of HTMLs from a machine to other one, and I am adding more information into the target HTMLs as a element. The problem I have is that the source documents are encoded in a lot of different encodings [UTF8, 8859-1, GB1232, etc.] and the meta information is stored as UTF-8, so, when I "dummily" merge my meta info with the original document, my meta info [that contains international characters] looks weird.

So, is there a way of use the HTML encoding defined in the <META> and in the !DOCTYPE tags in all an HTML document except in a TABLE or in a DIV Section that will use another encoding specified there?

thanks in advance,

Ernesto

+3  A: 

No, there isn't.

I suggest you use DOM parsers to read the various HTML bits into memory, and then construct a combined document in UTF-8. Once these HTML fragments are in memory (after parsing) they'll be in some sort of Unicode representation (depending on the programming language), and so no information should get lost along the way.

Assaf Lavie
+1  A: 

No, you need to use a character encoding that is a intersection of the encodings that are used. So in your case I suggest you to use UTF-8 for all of your documents. Or you use character references instead of the plain character itself, if they can not be encoded with the encoding that is used in the document.

Gumbo