views:

4237

answers:

4

Here's a quick Perl question:

How can I convert HTML special characters like ü or ' to normal ASCII text?

I started with something like this:

s/\&#(\d+);/chr($1)/eg;

and could write it for all HTML characters, but some function like this probably already exists?

Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.

+14  A: 

Take a look at HTML::Entities:

use HTML::Entities;

my $html = "Snoopy & Charlie Brown";

print decode_entities($html), "\n";

You can guess the output.

Telemachus
+1  A: 

There are a handful of predefined HTML entities - & " > and so on - that you could hard code.

However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.

Bevan
Quite right, Bevan. There's no such thing as a back-translation from Unicode to "plain ASCII". Joel wrote a really good article on text encodings, dehmann should read it...
AmbroseChapel
http://www.joelonsoftware.com/articles/Unicode.html 'All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs.'
AmbroseChapel
Perl: … making the hard things possible
daxim
+4  A: 

Note that there are hex-specified characters too. They look like this: é (é).

Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv) with the transliterate option on with some success in the past. But if you are dealing with a limited set of entities, or you don't actually need it reduced to ASCII equivalents, you may be better off limiting what decode_entities produces or providing it with custom conversion maps. See the HTML::Entities doc.

ysth
+3  A: 

The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.

Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:

use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);

my $source = '北亰';  
print unidecode(decode_entities($source));

# That prints: Bei Jing
Mark Fowler