ansaurus

Question

Answer 1

+5 A:

SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.

danben 2010-01-18 18:49:41

TagSoup is particularly flavorful if you have sloppy HTML to worry about.

bmargulies 2010-01-18 18:52:34

Answer 2

+2 A:

Don't use regular expressions when parsing HTML.

Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.

Welbog 2010-01-18 18:50:32

Answer 3

+3 A:

It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:

Pattern pText = Pattern.compile(">([^<>]*?)<");

Guffa 2010-01-18 18:52:35

ansaurus

tags:

views:

answers:

get text between html tags

related questions