ansaurus

Question

Answer 1

+1 A:

Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Evidently you're trying to parse an XHTML document using an external-entity-fetching parser. It's dragging in the DTD external subset so it can read any declarations for HTML-specific entities like   or é.

You're getting an HTTP 503 from the w3.org server hosting that DTD external subset at the moment, but even if you weren't it'd still be highly impolite to bombard that server with requests for the DTD every time you do a scrape. (Maybe they're blocking you, for that very reason?)

You could create an EntityResolver to return your own local copy of the DTD, or a pared-down version that only includes the entity definitions. Or you can ask the reader not to fetch the DTD at all, by using setFeature to turn that option off, if the XMLReader implementation you have supports that feature. (eg. for Xerxes.) Though then you might get in trouble if the document contains non-builtin entity references like  .

Also since this is a live web page being served as text/html, and especially because it comes from Microsoft, it's probably quite optimistic to assume it will remain well-formed! Screen scraping is usually best done with a parser that's tolerant of HTML quirks. But as the comments above state, using an API is much better bet than screen-scraping in any case.

bobince 2010-05-05 18:37:53

Answer 2

A:

You are correct Bob, and i thank you so much. My solution:use the set Feature method in url parser.

here is my solution, just add the lines: reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); and that's it.

Nir 2010-05-09 22:55:51

ansaurus

tags:

views:

answers:

Parsing XHTML results from Bing

related questions