ansaurus

Question

How do I make pQuery work with slightly malformed HTML?

Answer 1

+1 A:

Try HTML::Tidy, which fixes invalid HTML.

lonesomeday 2010-10-09 15:47:25

Sorry, but I need a pure-perl solution. It has now been clarified in the question. Thanks for the answer anyways! :-)

knorv 2010-10-09 15:53:17

Answer 2

A:

is that what you want?

$html_malformed =~ r|<+(<.*?>)>+|$1|g;

elektronikLexikon 2010-10-09 16:00:45

No, that would only catch the example given. I'm looking for a more general solution.

knorv 2010-10-09 16:11:40

Answer 3

+3 A:

I'd report this as a bug in pQuery. Here's a workaround:

use HTML::TreeBuilder;
use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $html_cleaned = HTML::TreeBuilder->new_from_content($html_malformed);
my $page = pQuery($html_cleaned->as_HTML);
$html_cleaned->delete;
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

This doesn't make a lot of sense, since pQuery already uses HTML::TreeBuilder as its underlying parsing mechanism, but it does work.

cjm 2010-10-09 19:27:03

ansaurus

tags:

views:

answers:

How do I make pQuery work with slightly malformed HTML?

related questions