views:

247

answers:

2

i have this tag as input tag:

<a href="controller.jsp?sid=127490C88DB5&R=35144" class="11-link-dkred-bold"><b>Mr. John Q. Anderson&nbsp;&nbsp;&nbsp;MBA 1977 E</a>

in this i want get the value

Mr. John Q. Anderson   MBA 1977 E

wat is patten value for this in regex?

+7  A: 

It is a Very Bad IdeaTM to parse HTML using regular expressions since it is not a regular language. You are better off running this through tidy (to clean it up), and then using an XML parser or use XPath.

Otherwise, the matching pattern with captures is:

<.*?>\([^<]+\)</.*?>

EDIT

I just noticed that your HTML is not well-formed! You don't have a closing </b> tag. The regex I gave you will only work if you one tag wrapping your text. It won't work for your example. Assuming you will always have a <b>...</b> tag inside you can do:

<.*?><b>\([^<]+\)</b></.*?>

EDIT

I just felt I had to re-iterate even after providing a regular expression. My conscience is getting to me. Just don't do it! Every time you use a regular expression to parse HTML a hundred cute and cuddly kittens and puppies and baby seals are mercilessly clubbed to death. Do you want that kind of blood on your hands?! DO YOU?!

Sorry. Just don't do it! :)

Vivin Paliath
+2  A: 

I suggest using NekoHTML or some alternative, see e.g. http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/

If you want to parse it yourself, use ANTLR or JavaCC or something similar. To do it right, you need a powerful grammar.

Chris Lercher