views:

334

answers:

4

I'm trying to extract some data from various HTML pages using a python program. Unfortunately, some of these pages contain user-entered data which occasionally has "slight" errors - namely tag mismatching.

Is there a good way to have python's xml.dom try to correct errors or something of the sort? Alternatively, is there a better way to extract data from HTML pages which may contain errors?

+3  A: 

You could use HTML Tidy to clean up, or Beautiful Soup to parse. Could be that you have to save the result to a temp file, but it should work.

Cheers,

Boldewyn
Beautiful Soup is not that great.
Geo
I guess it depends on what you want it to do.
Boldewyn
A: 

I used to use BeautifulSoup for such tasks but now I have shifted to HTML5lib (http://code.google.com/p/html5lib/) which works well in many cases where BeautifulSoup fails

other alternative is to use "Element Soup" (http://effbot.org/zone/element-soup.htm) which is a wrapper for Beautiful Soup using ElementTree

Anurag Uniyal
A: 

lxml does a decent job at parsing invalid HTML.

According to their documentation Beautiful Soup and html5lib sometimes perform better depending on the input. With lxml you can choose which parser to use, and access them via an unified API.

Luper Rouch
A: 

If jython is acceptable to you, tagsoup is very good at parsing junk - if it is, I found the jdom libraries far easier to use than other xml alternatives.

This is a snippet from a demo mockup to do with screen scraping from tfl's journey planner:

 private Document getRoutePage(HashMap params) throws Exception {
        String uri = "http://journeyplanner.tfl.gov.uk/bcl/XSLT_TRIP_REQUEST2";
        HttpWrapper hw = new HttpWrapper();
        String page = hw.urlEncPost(uri, params);
        SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
        Reader pageReader = new StringReader(page);
        return builder.build(pageReader);
    }