views:

30

answers:

1

Our website runs the user's input via HtmlTidy to clean it. Apparently while doing this it also causes pain for our international subscribers by converting umlauts. Is there any option to specify for HtmlTidy to not do this?

I tried CharacterEncoding with all possible options but nothing seems to be working.

A: 

Simply provide an output encoding (input encoding is optional) in the configuration file:

input-encoding: win1252
output-encoding: latin1

For an overview of available encodings, look at the output-encoding documentation.

EDIT: So you're using the .NET bindings. It's the very same settings:

Document d = new Document(new FileStream("in.html", FileMode.Open));

d.InputCharacterEncoding = EncodingType.Utf8;
d.OutputCharacterEncoding = EncodingType.Win1252;
d.CleanAndRepair();

d.Save("out.html");

With the correct encodings set, you will get the correct result, without ü and the like.

AndiDog
That did not work. The output came back garbled.
Nikhil Singhal
Are you sure the input encoding is correct? And did you save the output to a file? The (Windows) console font might not be able to show umlauts.
AndiDog
Both the input and output are via a web page. I am using Mark's .Net version of Tidy.
Nikhil Singhal
Found a bug in Mark's code. We have to set both Input and Output encodings seperately. And then it works.
Nikhil Singhal