Is there a Perl module out there that can take bad HTML (such as what is copied from Microsoft Word) and parse it into nicely formatted HTML? I have looked at HTML::Tidy, but it has gotten horrible reviews on CPAN. We have a custom legacy module that's basically a wrapper for the command line version of tidy (which seems to be pretty much what HTML::Tidy is), but it writes files to disk and reads them back in, which can be a big performance penalty. Certainly with Perl's awesome text parsing abilities, there's a better way to do this right?
+1
A:
Two things:
1) There really isn't an alternative to tidy, since it does the job for most people. Is there some behavior of the command-line tool that is inadequate for you? Perhaps if you presented an example of why it's not up to snuff we could get a better understanding of the problem.
2) Regarding performance, you might consider modifying your wrapper to call open2
on tidy to avoid the disk round-trip:
use IPC::Open2;
my $pid = open2(\*FROM_TIDY, \*TO_TIDY, '/usr/bin/tidy')
or die "couldn't open";
# give tidy our html and close the handle to tell it we're done
print(TO_TIDY $html_string);
close(TO_TIDY);
# read in the tidy html
while (<FROM_TIDY>) {
print;
}
close(FROM_TIDY);
bmdhacks
2008-10-30 18:22:20
That's likely to deadlock (with tidy blocking on a write to FROM_TIDY and perl blocking on a write to TO_TIDY).
ysth
2008-10-30 23:42:20
Nope. tidy reads the whole file in before parsing it, then writes the whole file out.
bmdhacks
2008-10-30 23:55:06