views:

70

answers:

2
String s= "(See <a href=\"/wiki/Grass_fed_beef\" title=\"Grass fed beef\" " +
          "class=\"mw-redirect\">grass fed beef.) They have been used for " +
          "<a href=\"/wiki/Paper\" title=\"Paper\">paper-making since " +
          "2400 BC or before.";

In the string above I have inter-mixed html with text.

Well the requirement is that the output looks like:-

They have been used for paper-making since 2400 BC or before.

Could some one help me with a generic regular expression that would produce the desired output from the given input?

Thanks in advance!

+1  A: 

http://stackoverflow.com/questions/1732348#1732454

You have been warned.

jjnguy
I'm sorry but I am new to this. Could you please tell me what the warning was? I might have not understood.
rookie
In a less horror-blockbuster tone: he is warning you that regular expressions **should not** be used to parse (X)HTML.
nc3b
@rookie Basically the point is that Regular expressions are not good for parsing html. Unless you have a very specific case. You should use an HTML parser tool instead.
jjnguy
Yes, I have used the Jericho HtmlParser. But these are specific cases and I can't seem to figure out a good enough regular expression to deal with these cases.The warning comment really left me stumped right there. :).
rookie
+1  A: 

The following expression:

\([^)]*?\)|<[a-zA-Z/][^>]*?>

will match anything that looks like an HTML tag and any parenthesized text. Replace said text with "", and there ya go.

Note: If you try to match any string that has script tags in it, or "HTML" where the author didn't bother to escape < and > when they weren't used as tag delimiters), or a ( without a ), things will probably not work as you'd hoped.

cHao
Thank you very much for your help. I'm sorry for any inconvenience with the way I've framed my question. But I thank you for understanding. I will make sure that I state my objectives better the next time. If its not too much of a bother, I can't seem to understand how this regular expression does the trick. Would it be possible for you to break it down? If not, that is okay too, I will try to figure it out. Thanks again for your help.
rookie
It's actually two parts. The first is \([^)]*?\), which will match a (, any number of chars that aren't ) (as few as possible, though -- hence the ?), and then a ). The second part is <[a-zA-Z/][^>]*?>, which will match an opening <, a letter (to try and avoid matching mistakenly unescaped <'s), and everything up to the next > the same way the () part works. The | between them means "or", so if either part matches, the expression matches.
cHao
The ?'s can actually be taken out, now that i think about it. It'd never match past the first delimiter, since we're specifying that the delimiter can never be part of the inner string.
cHao
Thank you very much. That really helped alot.
rookie