views:

313

answers:

3

Possible duplicate: http://stackoverflow.com/questions/299942/regex-matching-html-tags-and-extracting-text

I need to get the text between the html tag like <p></p> or whatever. My pattern is this

Pattern pText = Pattern.compile(">([^>|^<]*?)<");

Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.

Thanks

+5  A: 

SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.

danben
TagSoup is particularly flavorful if you have sloppy HTML to worry about.
bmargulies
+2  A: 

Don't use regular expressions when parsing HTML.

Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.

Welbog
+3  A: 

It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:

Pattern pText = Pattern.compile(">([^<>]*?)<");
Guffa