I need to convert a large website from static html written entirely by humans into proper relational data. First there comes a large amount of tables (not necessarily the same for every page), then code like this:
<a name=pidgin><font size=4 color=maroon>Pidgin</font><br></a>
<font size=2 color=teal>Author:</font><br>
<font size=2>Sean Egan</font><br>
<font size=2 color=teal>Version:</font><br>
<font size=2>2.6.8</font><br>
<font size=2><a href="http://pidgin.im/"><br>
<img src="images/homepage.jpg"></a>
</font><br>
<br><br><br>
<a name=psi><font size=4 color=maroon>Psi</font><br></a>
<font size=2 color=teal>Version:</font><br>
<font size=2>0.13</font><br>
<font size=2 color=teal>Screenshots:</font><br>
<a href="images/screenshots/psi/1.jpg">
<img src="images/screenshots/psi/1_s.jpg">
</a>
<a href="images/screenshots/psi/2.jpg">
<img src="images/screenshots/psi/2_s.jpg">
</a><br>
<br><br><br>
and then some tables again. I've tried using an HTML parser and looking for a[name] (a CSS selector), but i always got some entries lost: sometimes, because of non well-wormed html written by civilians, it thinks that some entries are inside each other instead of a flat list. Right now i'm using some Vim regexes grouped into a function which transform this code into XML, but this isn't a silver bullet either: most output files aren't well-formed because some HTML slipped in.
So i wonder which tools exist for doing tasks like this?