tags:

views:

184

answers:

5

Hi,

I am using Xpath (and java) to extract information from some websites. However my problem is that since some of these websites are not well-formed, I cannot process them. Is there any way to avoid well-formedness check or alternatively specify tags that should'nt be checked for well-formedness?

Thanks Rp

+5  A: 

Preprocess with Tidy.

Morendil
There's actually a Java port: http://sourceforge.net/projects/jtidy
BC
+1  A: 

You probably don't want to use an XML parser to parse HTML. You'd be better off using a library such as HtmlUnit or HtmlParser.

Marc Novakowski
+2  A: 

TagSoup is a SAX-compliant parser written in Java that can handle all kind of broken HTML. Try to use TagSoup as your XML parser and then process the output through Xpath.

potyl
+3  A: 

Check out http://nekohtml.sourceforge.net/ for turning the HTML into a DOM object

Rob Di Marco
A: 

mirdita si je a je ir

lona