views:

113

answers:

1

I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments:

The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen.

I'm using JDOM 1.1, and here's the code that does the actual cleaning:

    SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
    // Don't check the doctype! At our usage rate, we'll get 503 responses
    // from the w3.
    builder.setEntityResolver(dummyEntityResolver);
    Reader in = new StringReader(str);
    org.jdom.Document doc = builder.build(in);
    String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);

Any idea what's going wrong, or how to fix this? I need to be able to parse pages with long comment strings of <!--------- data ------------>

+1  A: 

An XML/HTML/SGML comment begins with --, ends with -- and does not contain --. A comment declaration contains zero or more comments.

Your example string can be reformatted as:

<!----
  ----
  - data
  ----
  ----
  ---->

As you can see, - data is not a valid comment and therefore the document is not valid HTML. In your specific case you can probably fix it by replacing the regular expression /<?!--.*?-->/ with the empty string, but be aware that this change might also break some valid documents.

Mark Byers
Given this, is there a way for tagsoup to swallow comments entirely? I do not need the comment data, and it is preventing me from parsing pages correctly. Fixing the HTML page should be a part of the library, I think, and if I can make tagsoup do this, all the better.
Stefan Kendall
@Stefan Kendall: I'm not sure it's possible for a generic library to fix this kind of error in general. Consider for example: `<!------ foo ----><b>data</b><!---- bar ------>`. This is a valid comment declaration but it probably doesn't do what you expect. It consists of the following comments: `<empty>`, ` foo `, `><b>data</b><!`, ` bar `, `<empty>`. If you remove comments, you will also remove the data. In other cases where the comment declaration is invalid, it is not clear what should happen.
Mark Byers
Ack, this is still a yucky situation. By manually scrubbing the data before sending it to tagsoup, I was able to get around the problem. The pages I'm scraping are pretty cookie-cutter, so I knew I could make the change without worrying about edge cases.
Stefan Kendall