How to convert an Html source of a webpage into org.w3c.dom.Documentin Java?
views:
54answers:
3
+1
A:
That's actually a fairly difficult thing to do robustly, because arbitrary HTML web pages are sometimes malformed (the major browsers are fairly tolerant). You may want to look into the swing html parser, which I've never tried but looks like it may be the best option. You also could try something along the lines of this and handle any parsing exceptions that may come up (although I've only ever tried this for xml):
import java.io.File;
import org.w3c.dom.Document;
import org.w3c.dom.*;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
...
try {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest);
}
catch (ParserConfigurationException e)
{
...
}
catch (SAXException e)
{
...
}
catch (IOException e)
{
...
}
...
Seth
2010-02-19 17:10:26
+1
A:
I suggest http://about.validator.nu/htmlparser/, which implements the HTML5 parsing algorithm. Firefox is in the process of replacing its own HTML parser with this one.
Ms2ger
2010-02-19 18:13:40
+1
A:
I have just been playing with JSoup, which is a fantastic Java HTML parser that works a little like jQuery. Really easy to use.
DisgruntledGoat
2010-02-21 23:58:11