ansaurus

Question

Regular expressions in java

Answer 1

+1 A:

http://stackoverflow.com/questions/1732348#1732454

You have been warned.

jjnguy 2010-05-27 21:57:26

I'm sorry but I am new to this. Could you please tell me what the warning was? I might have not understood.

rookie 2010-05-27 22:02:56

In a less horror-blockbuster tone: he is warning you that regular expressions **should not** be used to parse (X)HTML.

nc3b 2010-05-27 22:04:19

@rookie Basically the point is that Regular expressions are not good for parsing html. Unless you have a very specific case. You should use an HTML parser tool instead.

jjnguy 2010-05-27 22:04:40

Yes, I have used the Jericho HtmlParser. But these are specific cases and I can't seem to figure out a good enough regular expression to deal with these cases.The warning comment really left me stumped right there. :).

rookie 2010-05-27 22:07:55

Answer 2

+1 A:

The following expression:

\([^)]*?\)|<[a-zA-Z/][^>]*?>

will match anything that looks like an HTML tag and any parenthesized text. Replace said text with "", and there ya go.

Note: If you try to match any string that has script tags in it, or "HTML" where the author didn't bother to escape < and > when they weren't used as tag delimiters), or a ( without a ), things will probably not work as you'd hoped.

cHao 2010-05-27 22:09:13

Thank you very much for your help. I'm sorry for any inconvenience with the way I've framed my question. But I thank you for understanding. I will make sure that I state my objectives better the next time. If its not too much of a bother, I can't seem to understand how this regular expression does the trick. Would it be possible for you to break it down? If not, that is okay too, I will try to figure it out. Thanks again for your help.

rookie 2010-05-27 22:29:06

It's actually two parts. The first is \([^)]*?\), which will match a (, any number of chars that aren't ) (as few as possible, though -- hence the ?), and then a ). The second part is <[a-zA-Z/][^>]*?>, which will match an opening <, a letter (to try and avoid matching mistakenly unescaped <'s), and everything up to the next > the same way the () part works. The | between them means "or", so if either part matches, the expression matches.

cHao 2010-05-27 22:45:40

The ?'s can actually be taken out, now that i think about it. It'd never match past the first delimiter, since we're specifying that the delimiter can never be part of the inner string.

cHao 2010-05-27 22:54:01

Thank you very much. That really helped alot.

rookie 2010-05-27 22:59:37

ansaurus

tags:

views:

answers:

Regular expressions in java

related questions