views:

71

answers:

2

I have many HTML documents containing many HTML entities of Unicode code point representation, e.g. بروح

Is there a good tool to convert HTML entities in multiple HTML documents to plain UTF-8/UTF-16/UTF-32 characters?

I want an offline converter tool that can do a batch job for this purpose.

+3  A: 

I don't know of such a tool, but you could easily write one. This C# code for example would convert all html files in the current folder:

foreach (string name in Directory.GetFiles(".", "*.html")) {
  string s = File.ReadAllText(name);
  s = Regex.Replace(
    s,
    @"&#(\d+);",
    m => ((char)Int32.Parse(m.Groups[1].Value)).ToString()
  );
  File.WriteAllText(name, s);
}
Guffa
Oops, I missed regex could be used for this purpose. For simplicity, I will use PowerGREP instead. Thanks @Guffa.
Vantomex
Well, after more thinking, I realize that there is no decimal Unicode representation in regex. So, I'll go into your direction.
Vantomex
A: 

The GNU utility "recode" will do this, with the invocation

recode HTML..UTF-16LE < old.html > new.html

(or UTF-16BE, of course.)

http://ftp.gnu.org/gnu/recode/recode-3.6.tar.gz

It's use of HTML as a character set is a bit of a hack and is treated as either ASCII or LATIN-1, when it should be treated as a "surface" for any character set. If there are any UTF-8 characters, it can break, so I'm now withdrawing my recommendation. Use the first.

(You might expect recode UTF-8..HTML,HTML..UTF-16LE to work, but this first encodes the ampersands...)

wnoise
Will it just convert the encoding, or can also convert HTML entities to their plain characters?
Vantomex
@Vantomex: This converts both the text _and_ the HTML entities to UTF16. It defined this HTML pseudo-character-set specifically to handle HTML entities. Everything besides the entities is handled as ASCII.
wnoise
Yes, it works, thanks.
Vantomex
@Vantomex: Did you want to convert the files to UTF-16 also, not just convert the HTML entities? That's not what you answered in your comment...
Guffa
@Gufa, I didn't demand X encoding to UTF-16 conversion because I can do it easily using **HTML Tidy** or **UTFCast**. Actually your answer had already satisfied me, I would accept both answer if I could but unfortunately I couldn't, I vote up yours though :-). However, since my question is about a *tool*, I think @wnoise answer was more relevant.
Vantomex