tags:

views:

747

answers:

4

Hi, I am trying to build a regular expression to extract the text inside the HTML tag as shown below. However I have limited skills in regular expressions, and I'm having trouble building the string.

How can I extract the text from this tag:

<a href="javascript:ProcessQuery('report_drilldown',145817)">text</a>

That is just a sample of the HTML source of the page. Basically, I need a regex string to match the "text" inside of the tag. Can anyone assist me with this? Thank you. I hope my question wasn't phrased too horribly...

UPDATE: Just for clarification, report_drilldown is absolute, but i don't really care if it's present in the regex as absolute or not. 145817 is a random 6 digit number that is actually a database id. "text" is just simple plain text, so it shouldn't be invalid HTML. Also, most people are saying that it's best to not use regex in this situation, so what would be best to use? Thanks so much!

+2  A: 
<a href="javascript:ProcessQuery\('report_drilldown',[0-9]+\)">([^<]*)</a>

This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']*, and/or capture both it and the number if you need.

If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)

Roger Pate
Attributes in HTML aren't supposed to contain <. And it's a well-formedness constraint in XML.
Roger Pate
Yes im sorry stupid console fonts are mixing me up - it was supposed to be (). Thanks for your help!
Maxim Zaslavsky
Hah, I update my post, see your answer, and now rollback to the original.
Roger Pate
Sorry about that!!!! My bad - now im convinced that i need to find a better font for CMD. Thanks!
Maxim Zaslavsky
Lucida Console and Envy Code R (search google for it) work well for me.
Roger Pate
THank you thank you thank you! For all the things (including the font)! Im gonna try and implement that regular expression and will see what happens - Thanks a bunch!
Maxim Zaslavsky
Guess what? It worked! Thank you so much!
Maxim Zaslavsky
A: 
<a href\=\"[^\x00]*?\">

should get you the opening tag.

<\/a>

will give you the closing tag. Just extract out what is in between. Untested though.

fung
Do you mean \x intead of /x? Why any character except null? Why are = and " escaped? Since you're not using / delimiters in sed-style, escaping / is a little strange too.
Roger Pate
+3  A: 

The answer is... DON'T!

Use a library, such as this one:

http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=272

the.jxc
Thanks! I will try to implement that library.
Maxim Zaslavsky
+1  A: 

I agree regex might not be the best way to parse this, but using backreference it's easily done:

<(?<tag>\w*)(?:.*)>(?<text>.*)</\k<tag>>

Where tag and text are named capture groups.

hat-tip: expresso library

Si
Thanks! Will try it.
Maxim Zaslavsky
Even assuming well-formed input (if it's not, this style of parsing may fail or, worse, incorrectly succeed) you have two problems shown by this sample input: 1) <em><em>text</em>more text</em>. 2) <em>a</em><em>b</em>. Of course, your answer is really no better than mine, but I would be hesitant to call it easily done. Regex is simply the wrong tool for this job, even when it works occasionally.
Roger Pate
Ok. I am going to continue searching for a very "safe" and "good" method to process such "tag soup", but for now, as R. Pate's regex is working, I'm going to continue using it until i find a better solution. Thanks so much everybody!!!
Maxim Zaslavsky