views:

67

answers:

2

I need to convert a large website from static html written entirely by humans into proper relational data. First there comes a large amount of tables (not necessarily the same for every page), then code like this:

<a name=pidgin><font size=4 color=maroon>Pidgin</font><br></a>
<font size=2 color=teal>Author:</font><br>
<font size=2>Sean Egan</font><br>
<font size=2 color=teal>Version:</font><br>
<font size=2>2.6.8</font><br>
<font size=2><a href="http://pidgin.im/"&gt;&lt;br&gt;
    <img src="images/homepage.jpg"></a>
</font><br>
<br><br><br>

<a name=psi><font size=4 color=maroon>Psi</font><br></a>
<font size=2 color=teal>Version:</font><br>
<font size=2>0.13</font><br>
<font size=2 color=teal>Screenshots:</font><br>
<a href="images/screenshots/psi/1.jpg">
    <img src="images/screenshots/psi/1_s.jpg">
</a>
<a href="images/screenshots/psi/2.jpg">
    <img src="images/screenshots/psi/2_s.jpg">
</a><br>
<br><br><br>

and then some tables again. I've tried using an HTML parser and looking for a[name] (a CSS selector), but i always got some entries lost: sometimes, because of non well-wormed html written by civilians, it thinks that some entries are inside each other instead of a flat list. Right now i'm using some Vim regexes grouped into a function which transform this code into XML, but this isn't a silver bullet either: most output files aren't well-formed because some HTML slipped in.

So i wonder which tools exist for doing tasks like this?

+1  A: 

If you're comfortable with Python, BeautifulSoup was created to solve exactly this problem:

"You didn't write that awful page. You're just trying to get some data out of it."

I've used BeautifulSoup to do this kind of work before, and it's very good.

RichieHindle
Thanks, but i've already tried parsing. Not that it's too awful, but the structure the original code isn't too friendly for that. As you can see in my code example, it is a flat list instead of something nested into divs or tables.
RommeDeSerieux
+2  A: 

The first thing to do would be to throw your input HTML through a tool like HTML Tidy to at least ensure it's valid (X)HTML. Then I'd use some kind of dom-based parsing (rather than reg-ex) to go through the code.

Dan Diplo
Thanks, but HTML Tidy itself doesn't help: the order of opening and closing tags in the code i need to parse is so messed up that it comes out nested in a different way every time. That's the way it ends up in a DOM parser.
RommeDeSerieux