ansaurus

Question

How can I parse and normalize HTML from different HTML generators?

Answer 1

+3 A:

HTML::Parser?

drdaeman 2009-06-11 16:44:47

Answer 2

+2 A:

The HTML generated by Word is relatively easier to deal with. I would just get rid of all the tag attributes (unless you care about styles). That would live you with fairly plain HTML which you can then style.

HTML::TokeParser::Simple can help make that relatively painless.

As for the other stuff, that will take some trial and error. I am going to think more about that and post later if I can think of something clever.

Later Update:

Well, here is something that makes me cringe a little but it seems to work:

#!/usr/bin/perl

use strict;
use warnings;

use File::Slurp;
use Text::Markdown qw( markdown );

my $html = read_file \*DATA;

$html =~ s{(?:<br(:? ?/)*>)}{\n\n}g;

print markdown( $html );

__DATA__
This is a section of a blog post. It has <a href="#">links</a> and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li></ul>
And another paragraph here...
<br>
<br/>

Output:

<p>This is a section of a blog post. It has <a href="#">links</a> and lists and
stuff. Weee....</p>

<p>Here's a list</p>

<ul><li>Item 1</li><li>Item 2</li></ul>

<p>And another paragraph here...</p>

Sinan Ünür 2009-06-11 17:59:58

Answer 3

+2 A:

As I said in the other question, I like XML::Twig. It can handle both XML and HTML.

Chas. Owens 2009-06-11 19:44:07

Can it take care of the untagged text and convert it to a real ``? That's my main concern...

Andrew 2009-06-11 20:44:03

No, for that you need to use the same algorithm Blogger is using.

Chas. Owens 2009-06-11 23:08:49

This is why I would use XML::LibXML, which has explicit support for repairing HTML documents.

jrockway 2009-06-12 04:54:58

Answer 4

A:

FWIW, I tend to use XML::LibXML for all my XML and HTML needs. Here is a one-liner that will convert a line of "bad" HTML into a well-formed XHTML document:

perl -MXML::LibXML -ne 'my $p = XML::LibXML->new->parse_html_string($_); print $p->toString'

In your case, you probably want to use the DOM to emit a new document that has the correct tags. This is straightforward; XML::LibXML uses the same W3C DOM that JavaScript does.

As an example, this input:

<p>Foo<p>Bar<br>Baz!

Gets translated into:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body><p>Foo</p><p>Bar<br/>Baz!
</p></body></html>

This is probably what you want, and remember, use the DOM to translate... don't worry about this printed representation.

jrockway 2009-06-12 04:53:50

Cool. Is there any way to not get a full blown XML/HTML file and just get the FooBar Baz!, or would I need to traverse the DOM to get just that out?

Andrew 2009-06-12 05:35:25

You need to traverse the DOM.

jrockway 2009-06-12 12:23:55

ansaurus

tags:

views:

answers:

How can I parse and normalize HTML from different HTML generators?

related questions