ansaurus

Question

Help with building Regex to extract text of HTML tag

Answer 1

+2 A:

<a href="javascript:ProcessQuery\('report_drilldown',[0-9]+\)">([^<]*)</a>

This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']*, and/or capture both it and the number if you need.

If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)

Roger Pate 2009-06-30 01:49:22

Attributes in HTML aren't supposed to contain <. And it's a well-formedness constraint in XML.

Roger Pate 2009-06-30 02:04:56

Yes im sorry stupid console fonts are mixing me up - it was supposed to be (). Thanks for your help!

Maxim Zaslavsky 2009-06-30 02:05:30

Hah, I update my post, see your answer, and now rollback to the original.

Roger Pate 2009-06-30 02:09:24

Sorry about that!!!! My bad - now im convinced that i need to find a better font for CMD. Thanks!

Maxim Zaslavsky 2009-06-30 02:10:48

Lucida Console and Envy Code R (search google for it) work well for me.

Roger Pate 2009-06-30 02:40:41

THank you thank you thank you! For all the things (including the font)! Im gonna try and implement that regular expression and will see what happens - Thanks a bunch!

Maxim Zaslavsky 2009-06-30 03:54:21

Guess what? It worked! Thank you so much!

Maxim Zaslavsky 2009-06-30 05:20:17

Answer 2

A:

<a href\=\"[^\x00]*?\">

should get you the opening tag.

<\/a>

will give you the closing tag. Just extract out what is in between. Untested though.

fung 2009-06-30 01:50:09

Do you mean \x intead of /x? Why any character except null? Why are = and " escaped? Since you're not using / delimiters in sed-style, escaping / is a little strange too.

Roger Pate 2009-06-30 01:53:56

Answer 3

+3 A:

The answer is... DON'T!

Use a library, such as this one:

http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=272

the.jxc 2009-06-30 02:12:56

Thanks! I will try to implement that library.

Maxim Zaslavsky 2009-06-30 04:50:06

Answer 4

+1 A:

I agree regex might not be the best way to parse this, but using backreference it's easily done:

<(?<tag>\w*)(?:.*)>(?<text>.*)</\k<tag>>

Where tag and text are named capture groups.

hat-tip: expresso library

Si 2009-06-30 04:12:41

Thanks! Will try it.

Maxim Zaslavsky 2009-06-30 04:49:33

Even assuming well-formed input (if it's not, this style of parsing may fail or, worse, incorrectly succeed) you have two problems shown by this sample input: 1) textmore text. 2) ab. Of course, your answer is really no better than mine, but I would be hesitant to call it easily done. Regex is simply the wrong tool for this job, even when it works occasionally.

Roger Pate 2009-06-30 05:33:56

Ok. I am going to continue searching for a very "safe" and "good" method to process such "tag soup", but for now, as R. Pate's regex is working, I'm going to continue using it until i find a better solution. Thanks so much everybody!!!

Maxim Zaslavsky 2009-06-30 17:10:45

ansaurus

tags:

views:

answers:

Help with building Regex to extract text of HTML tag

related questions