views:

278

answers:

3

Hi,

I have some non well-formed xml (HTML) data in JAVA, I used JAXP Dom, but It complains.

The Question is :Is there any way to use JAXP to parse such documents ??

I have a file containing data such as :

<employee>
 <name value="ahmed" > <!-- note, this element is not closed, So it is not well-formed xml-->
</employee>
+1  A: 

Not really. JAXP wants well-formed markup. Have you considered the Cyberneko HTML Parser? We've been very successful with it at our shop.

EDIT: I see you are wanting to parse XML too. Hrmm.... Cyberneko works well for HTML but I don't know about others. It has a tag balancer that would close some tags off, but I don't know if you can train it to recognize tags that are not HTML.

Andy Gherna
Is it allow modification on the doc ?
Mohammed
It is a parser, so you will have to parse the document using the DOM HTML parser and then modify the document in the resulting DOM tree. There are settings that you can enable to help you get a good result tree and they are documented at http://nekohtml.sourceforge.net/settings.html
Andy Gherna
+4  A: 

You could try running your document through the jtidy API first - that has the ability to convert html into valid xhtml: http://jtidy.sourceforge.net/howto.html

Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(......)...
simonlord
+5  A: 

You could use TagSoup. I have used it with great success. It is completely compatible with the Java XML APIs, including SAX, DOM, XSLT, and StAX. For example, here is how I used it to apply XSLT transforms to particularly poor HTML:

public static void transform(InputStream style, InputStream data)
        throws SAXException, TransformerException {
    XMLReader reader =
        XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
    Source input = new SAXSource(reader, new InputSource(data));
    Source xsl = new StreamSource(style);
    Transformer transformer =
        TransformerFactory.newInstance().newTransformer(xsl);
    transformer.transform(input, new StreamResult(System.out));
}
Steven Huwig