ansaurus

Question

Java remove HTML from String without regular expressions

Answer 1

A:

If you can add external jars you can try with those two small libs:

tagsoup, it's a sax parser
jericho html, another small html parser

they both allow you to strip everything.

I used jericho many times, to strip you define an extractor as you like it:

class HTMLStripExtractor extends TextExtractor
{
    public HTMLStripExtractor(Source src)
    {       
        super(src)  
        src.setLogger(null)
    }

    public boolean excludeElement(StartTag startTag)
    {
        return startTag.getName() != HTMLElementName.A
    }
}

Jack 2010-03-21 23:10:41

Answer 2

+1 A:

I'd try to tackle this the other way around, create a DOM tree from the HTML and then extract the string from the tree:

Use a library like TagSoup to parse in the HTML while cleaning it up to be close to XHTML.
As you're streaming the cleaned up XHTML, extract the text you want.

Jim Ferrans 2010-03-21 23:14:20

Answer 3

+3 A:

There are a lot of nuances to parsing HTML in the wild, one of the funnier ones being that many pages out there do not follow any standard. This said, if all your HTML is going to be as simple as your example, something like this is more than enough:

    char[] cs = s.toCharArray();
    StringBuilder sb = new StringBuilder();
    boolean tag = false;
    for (int i=0; i<cs.length; i++) {
        switch(cs[i]) {
            case '<': if ( ! tag) { tag = true; break; }
            case '>': if (tag) { tag = false; break; }
            case '&': i += interpretEscape(cs, i, sb); break;
            default: if ( ! tag) sb.append(cs[i]);
        }
    }
    System.err.println(sb);

Where interpretEscape() is supposed to know how to convert HTML escapes such as > to their character counterparts, and skip all characters up to the ending ;.

tucuxi 2010-03-21 23:24:31

The HTML should always be pretty simple, as shown in my example. This works for me. Thanks very much!

behrk2 2010-03-21 23:40:29

Looks good. You'll probably need to alter it slightly for the <![CDATA[]> though: the current one will skip the whole content.

Daniel 2010-03-21 23:45:33

Answer 4

+1 A:

I cannot use regular expressions because I am developing on the Blackberry platform

You cannot use regular expressions because HTML is a recursive language and regular expressions can't handle those.

You need a parser.

EJP 2010-03-22 09:25:37

Answer 5

A:

Hi. I do not find the method nterpretEscape in the above call which class have this method incluide it.?.

calati 2010-05-21 11:55:45

ansaurus

tags:

views:

answers:

Java remove HTML from String without regular expressions

related questions