ansaurus

Question

Ideal Java library for cleaning html, and escaping malformed fragments

Answer 1

+1 A:

You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.

I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.

Adam Batkin 2010-03-01 19:17:09

I'm trying TagSoup, but I'm struggling. It just keeps on truckin' and never triggers the error handler.

Tyler 2010-03-01 20:16:54

I have begun to modify the TagSoup source, and it looks promising. Will post some code if I get it working.

Tyler 2010-03-02 00:43:03

Answer 2

A:

Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.

Here is my regular expression code to escape unknown tags like <M+1>

private static String escapeUnknownTags(String input) {
    Scanner scan = new Scanner(input);

    StringBuilder builder = new StringBuilder();

    while (scan.hasNext()) {

        String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);

        if (s == null) {
            builder.append(escape(scan.next(".*")));
        } else {

            processMatch(s, builder);
        }

    }

    return builder.toString();
}

private static void processMatch(String s, StringBuilder builder) {

    if (!isKnown(s)) {
        String escaped = escape(s);

        builder.append(escaped);
    }
    else {
        builder.append(s);
    }

}

private static String escape(String s) {
    s = s.replaceAll("<", "&lt;");
    s = s.replaceAll(">", "&gt;");
    return s;
}

private static boolean isKnown(String s) {
    Scanner scan = new Scanner(s);
    if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {

        return false;
    }

    MatchResult mr = scan.match();

    try {

        String tag = mr.group(1).toLowerCase();

        if (HTML.getTag(tag) != null) {
            return true;
        }
    }
    catch (Exception e) {
        // Should never happen
        e.printStackTrace();
    }

    return false;
}

Tyler 2010-03-03 22:39:18

Answer 3

A:

I have a similar situation and I am thinking of using a similar strategy to yours.

What is the HTML object mentioned in your code above? Is it a list of all possible html tags?

Can you post the definition for this object please?

Chris 2010-04-16 09:23:14

You are correct in your follow-up post, it was javax.swing.text.html.HTML.I think the code I posted is still working unchanged, but you may want to do some testing on those regular expressions.

Tyler 2010-04-16 11:27:15

Answer 4

A:

HTML cleaner

Fakrudeen 2010-04-16 10:11:43

Answer 5

A:

ok, I suspect it is this.

javax.swing.text.html.HTML

Chris 2010-04-16 10:34:10

Yes, you are correct.

Tyler 2010-04-16 11:29:41

ansaurus

tags:

views:

answers:

Ideal Java library for cleaning html, and escaping malformed fragments

related questions