views:

73

answers:

3

I'm attempting to do some screen scraping however the html being returned is causing an error as there is no header (i think). Below is the code

public class xpath
{
  private Document doc = null;

  public xpath()
  {
    HttpClient httpclient = new DefaultHttpClient();
    HttpGet httpget = new HttpGet("http://blah.com/blah.php?param1=value1&param2=value2");

    ResponseHandler<String> responseHandler = new BasicResponseHandler();

    try
    {
      String responseBody = httpclient.execute(httpget, responseHandler);
      doc = parserXML(responseBody);

      visit(doc, 0);
    }
      catch(Exception error)
    {
      error.printStackTrace();
    }
  }

  public void visit(Node node, int level)
  {
    NodeList nl = node.getChildNodes();

    for(int i=0, cnt=nl.getLength(); i<cnt; i++)
    {
      System.out.println("["+nl.item(i)+"]");

      visit(nl.item(i), level+1);
    }
  }

  public Document parserXML(String file) throws SAXException, IOException, ParserConfigurationException
  {
    return DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
  }

  public static void main(String[] args)
  {
     new xpath();
  }
}

Its throwing the exception "java.net.MalformedURLException: no protocol:"

Is there a way of getting the parser to be a bit more forgiving?

Thanks

+1  A: 

The Exception you mention isn't anything to do with XML parsing, by the way. It suggests that the URL you provided couldn't be parsed properly. DocumentBuilder's parse(String uri) method thinks that string you are passing is a URI and is trying to parse it as such, so you get your exception.


I don't think you can be lenient with Java's default XML parsers. The SAX parser is for XML and must barf if the data is not well-formed.

You likely want to swap your XML parsing stuff for an HTML parser like this one. There's a list open source HTML parsers for Java here. You might be able to find one that exposes a nicer browser-ish API as a bonus.

Brabster
HtmlUnit is also a nice one: http://htmlunit.sourceforge.net Another favorite is JTidy: http://jtidy.sourceforge.net/
BalusC
A: 

There are parsers that can read invalid html/xml. I've used HTMLTidy and it did the job.

pablochan
A: 

Just print the responseBody string and see if there is some valid content in it ?

Calm Storm