ansaurus

Question

What is the best way to parse HTML from a Rich Text Editor in Perl?

Answer 1

+1 A:

Two things:

1) There really isn't an alternative to tidy, since it does the job for most people. Is there some behavior of the command-line tool that is inadequate for you? Perhaps if you presented an example of why it's not up to snuff we could get a better understanding of the problem.

2) Regarding performance, you might consider modifying your wrapper to call open2 on tidy to avoid the disk round-trip:

use IPC::Open2;

my $pid = open2(\*FROM_TIDY, \*TO_TIDY, '/usr/bin/tidy')
    or die "couldn't open";

# give tidy our html and close the handle to tell it we're done
print(TO_TIDY $html_string);
close(TO_TIDY);

# read in the tidy html
while (<FROM_TIDY>) {
    print;
}
close(FROM_TIDY);

bmdhacks 2008-10-30 18:22:20

That's likely to deadlock (with tidy blocking on a write to FROM_TIDY and perl blocking on a write to TO_TIDY).

ysth 2008-10-30 23:42:20

Nope. tidy reads the whole file in before parsing it, then writes the whole file out.

bmdhacks 2008-10-30 23:55:06

ansaurus

tags:

views:

answers:

What is the best way to parse HTML from a Rich Text Editor in Perl?

related questions