+3  A: 

HTML::Parser?

drdaeman
+2  A: 

The HTML generated by Word is relatively easier to deal with. I would just get rid of all the tag attributes (unless you care about styles). That would live you with fairly plain HTML which you can then style.

HTML::TokeParser::Simple can help make that relatively painless.

As for the other stuff, that will take some trial and error. I am going to think more about that and post later if I can think of something clever.

Later Update:

Well, here is something that makes me cringe a little but it seems to work:

#!/usr/bin/perl

use strict;
use warnings;

use File::Slurp;
use Text::Markdown qw( markdown );

my $html = read_file \*DATA;

$html =~ s{(?:<br(:? ?/)*>)}{\n\n}g;

print markdown( $html );

__DATA__
This is a section of a blog post. It has <a href="#">links</a> and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li></ul>
And another paragraph here...
<br>
<br/>

Output:

<p>This is a section of a blog post. It has <a href="#">links</a> and lists and
stuff. Weee....</p>

<p>Here's a list</p>

<ul><li>Item 1</li><li>Item 2</li></ul>

<p>And another paragraph here...</p>
Sinan Ünür
+2  A: 

As I said in the other question, I like XML::Twig. It can handle both XML and HTML.

Chas. Owens
Can it take care of the untagged text and convert it to a real `<p>`? That's my main concern...
Andrew
No, for that you need to use the same algorithm Blogger is using.
Chas. Owens
This is why I would use XML::LibXML, which has explicit support for repairing HTML documents.
jrockway
A: 

FWIW, I tend to use XML::LibXML for all my XML and HTML needs. Here is a one-liner that will convert a line of "bad" HTML into a well-formed XHTML document:

perl -MXML::LibXML -ne 'my $p = XML::LibXML->new->parse_html_string($_); print $p->toString'

In your case, you probably want to use the DOM to emit a new document that has the correct tags. This is straightforward; XML::LibXML uses the same W3C DOM that JavaScript does.

As an example, this input:

<p>Foo<p>Bar<br>Baz!

Gets translated into:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body><p>Foo</p><p>Bar<br/>Baz!
</p></body></html>

This is probably what you want, and remember, use the DOM to translate... don't worry about this printed representation.

jrockway
Cool. Is there any way to not get a full blown XML/HTML file and just get the <p>Foo</p><p>Bar<br/>Baz!</p>, or would I need to traverse the DOM to get just that out?
Andrew
You need to traverse the DOM.
jrockway