pQuery is a pragmatic port of the jQuery JavaScript framework to Perl which can be used for screen scraping.
pQuery quite sensitive to malformed HTML. Consider the following example:
use pQuery;
my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $page = pQuery($html_malformed);
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";
pQuery won't find the title tag in the example above due to the double ">>
" in the malformed HTML.
To make my pQuery based applications more tolerant to malformed HTML I need to pre-process the HTML by cleaning it up before passing it to pQuery.
Starting with the code fragment given above, what is the most robust pure-perl way to clean-up the HTML to make it parse:able by pQuery?