ansaurus

Question

How can I strip HTML in a string using Perl?

Answer 1

+10 A:

Assuming the code is valid HTML (no stray < or > operators)

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Strip module

abhinavg 2009-07-01 05:31:04

I wouldn't say might, I would say should.Attempting to sanitize HTML with regexes is absurd in this day and age.Use one of the numerous HTML sanitizing modules from the CPAN, preferably something designed to prevent XSS vulnerabilities and not written by Daniel Muey.

nothingmuch 2009-07-02 10:52:04

Answer 2

+7 A:

From perlfaq9: How do I remove HTML from a string?

The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

<!-- This section commented out.
 <B>You can't see me!</B>
-->

brian d foy 2009-07-01 08:16:54

ansaurus

tags:

views:

answers:

How can I strip HTML in a string using Perl?

related questions