tags:

views:

208

answers:

5

I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.

I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?

Clarification:

Sample input: <p> blah blah <M+1> blah </p>

Desired output: <p> blah blah &lt;M+1&gt; blah </p>

+1  A: 

You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.

I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.

Adam Batkin
I'm trying TagSoup, but I'm struggling. It just keeps on truckin' and never triggers the error handler.
Tyler
I have begun to modify the TagSoup source, and it looks promising. Will post some code if I get it working.
Tyler
A: 

Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.

Here is my regular expression code to escape unknown tags like <M+1>

private static String escapeUnknownTags(String input) {
    Scanner scan = new Scanner(input);

    StringBuilder builder = new StringBuilder();

    while (scan.hasNext()) {

        String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);

        if (s == null) {
            builder.append(escape(scan.next(".*")));
        } else {

            processMatch(s, builder);
        }

    }

    return builder.toString();
}

private static void processMatch(String s, StringBuilder builder) {

    if (!isKnown(s)) {
        String escaped = escape(s);

        builder.append(escaped);
    }
    else {
        builder.append(s);
    }

}

private static String escape(String s) {
    s = s.replaceAll("<", "&lt;");
    s = s.replaceAll(">", "&gt;");
    return s;
}

private static boolean isKnown(String s) {
    Scanner scan = new Scanner(s);
    if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {

        return false;
    }

    MatchResult mr = scan.match();

    try {

        String tag = mr.group(1).toLowerCase();

        if (HTML.getTag(tag) != null) {
            return true;
        }
    }
    catch (Exception e) {
        // Should never happen
        e.printStackTrace();
    }

    return false;
}
Tyler
A: 

I have a similar situation and I am thinking of using a similar strategy to yours.

What is the HTML object mentioned in your code above? Is it a list of all possible html tags?

Can you post the definition for this object please?

Chris
You are correct in your follow-up post, it was javax.swing.text.html.HTML.I think the code I posted is still working unchanged, but you may want to do some testing on those regular expressions.
Tyler
A: 

HTML cleaner

Fakrudeen
A: 

ok, I suspect it is this.

javax.swing.text.html.HTML

Chris
Yes, you are correct.
Tyler