tags:

views:

56

answers:

4

Let's say we have a html string like "2 < 4"

How should be determined if it contains any of these extended sequences?

I 've found HTML::Entities on CPAN, but it doesn't provide 'check' method.

Details: fixing 'truncate' method in a way to not leave corrupted string like "2 &l" and not to do unnecesary work. It should look like this

$s = HTML::Entities::decode_entities ($s) if $has_ext_chars;
$s = substr ($s, 0, $len - 3) . '...' if length $s > $len;
$s = HTML::Entities::encode_entities ($s, "‚„-‰‹‘-™›\xA0¤¦§©«-®°-±µ-·»") if $has_ext_chars;

How do I determine $has_ext_chars?

A: 

You can try it with a regular expression

$str =~ /.*\&[^\s]+;.*/
krico
A: 

A complete list of character entities can be found on the W3C reference.

You have also to match \&#u?\d+; and \&#x[a-fA-F0-9]+;

Benoit
+1  A: 

From perldoc HTML::Entities:

The module can also export the %char2entity and the %entity2char hashes, which contain the mapping from all characters to the corresponding entities (and vice versa, respectively).

You can probably use them to build regexes. For example, to match entities:

use HTML::Entities '%entity2char';

my $regex = "&(?:" . join("|", map {s/;\z//; $_} keys %entity2char) . ");";

if ($str =~ /$regex/) {
    print "$str contains entities\n";
}

This will skip entities like &#entity_number; though.

eugene y
It's pretty easy to add the numeric entities using `. '|#[0-9]+' . '|#x[0-9a-fA-F]+'`.
larsmans
A: 

From your code sample you have probably just introduced a cross site scripting attack into your application. If I were to get your code to process something like &lt;script src="evil.example.com"&gt;&lt;/script&gt; your code would decode it to valid HTML and not re-encode the < and > back to entities. (The angle brackets in the code are not ASCII angle brackets.)

If you are truncating a string that contains any HTML tags or entities you will probably break something if you use a simple solution. You might be better off building a solution based on an HTML parsing module. If you are only looking at text inside an element with no elements inside it you can grab the text, truncate it and then replace it back into the element. If you have to deal with mixed content it will be more complicated.

But in the interest of bad solutions:

#treats each entity as one character "2 &lt; 4" is 5 characters long
$trunc_len = $len - 3;
$str =~ s/^((?>(?:[^&]|&[^\s;]+;?){$trunc_len}))(?:[^&]|&[^\s;]+;?){4,}/$1.../;

#abuses proceadural nature of the regexp engine 
#treats each input character as on character "2 &lt; 4" is 8 characters long
$str =~ s/^( (?:[^&]|&[^\s;]+;?)+ )(?(?{ $found = (pos() > ( $found ? $len - 3 : $len ))})(?!)).*$(?(?{pos() < $len })(?!))/$1.../x;

Both are fairly permissive in what is an entity to allow for common browser quirks.

Ven'Tatsu