ansaurus

Question

Answer 1

+1 A:

If you're comfortable with Python, BeautifulSoup was created to solve exactly this problem:

"You didn't write that awful page. You're just trying to get some data out of it."

I've used BeautifulSoup to do this kind of work before, and it's very good.

RichieHindle 2009-08-09 19:56:55

Thanks, but i've already tried parsing. Not that it's too awful, but the structure the original code isn't too friendly for that. As you can see in my code example, it is a flat list instead of something nested into divs or tables.

RommeDeSerieux 2009-08-09 20:16:53

Answer 2

+2 A:

The first thing to do would be to throw your input HTML through a tool like HTML Tidy to at least ensure it's valid (X)HTML. Then I'd use some kind of dom-based parsing (rather than reg-ex) to go through the code.

Dan Diplo 2009-08-09 19:59:23

Thanks, but HTML Tidy itself doesn't help: the order of opening and closing tags in the code i need to parse is so messed up that it comes out nested in a different way every time. That's the way it ends up in a DOM parser.

RommeDeSerieux 2009-10-19 21:03:10

ansaurus

tags:

views:

answers:

Tools for data mining hand-written html

related questions