The HTML generated by Word is relatively easier to deal with. I would just get rid of all the tag attributes (unless you care about styles). That would live you with fairly plain HTML which you can then style.
HTML::TokeParser::Simple can help make that relatively painless.
As for the other stuff, that will take some trial and error. I am going to think more about that and post later if I can think of something clever.
Later Update:
Well, here is something that makes me cringe a little but it seems to work:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use Text::Markdown qw( markdown );
my $html = read_file \*DATA;
$html =~ s{(?:<br(:? ?/)*>)}{\n\n}g;
print markdown( $html );
__DATA__
This is a section of a blog post. It has <a href="#">links</a> and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li></ul>
And another paragraph here...
<br>
<br/>
Output:
<p>This is a section of a blog post. It has <a href="#">links</a> and lists and
stuff. Weee....</p>
<p>Here's a list</p>
<ul><li>Item 1</li><li>Item 2</li></ul>
<p>And another paragraph here...</p>
As I said in the other question, I like XML::Twig. It can handle both XML and HTML.
FWIW, I tend to use XML::LibXML for all my XML and HTML needs. Here is a one-liner that will convert a line of "bad" HTML into a well-formed XHTML document:
perl -MXML::LibXML -ne 'my $p = XML::LibXML->new->parse_html_string($_); print $p->toString'
In your case, you probably want to use the DOM to emit a new document that has the correct tags. This is straightforward; XML::LibXML uses the same W3C DOM that JavaScript does.
As an example, this input:
<p>Foo<p>Bar<br>Baz!
Gets translated into:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Foo</p><p>Bar<br/>Baz!
</p></body></html>
This is probably what you want, and remember, use the DOM to translate... don't worry about this printed representation.