views:

107

answers:

3

Hi,

I need to carry out a task that is to get some html out from a webpage. Within the webpage there are comments and i need to get the html out from within the comments. I hope the example below can help. I need it to be done in c#.

<!--get html from here-->
<div><p>some text in a tag</p></div>
<!--get html from here-->

I want it to return

<div><p>some text in a tag</p></div>

How would I do this??

+1  A: 

What about finding the index of the first delimiter, the index of the second delimiter and "cropping" the string in between? Sounds way simpler, might be as much effective as.

Manrico Corazzi
+2  A: 

Regexes are not ideal for HTML. If you really do want to process the HTML in all its glory, consider HtmlAgilityPack as discussed in this question. http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410

The Simplest Thing That Could Possibly Work is:

string pageBuffer=...;
string wrapping="<!--get html from here-->";
int firstHitIndex=pageBuffer.IndexOf(wrapping) + wrapping.Length;
return pageBuffer.Substring( firstHitIndex, pageBuffer.IndexOf( wrapping, firstHitIndex) - firstHitIndex));

(with error checking that both markers are present)

Depending on your context, WatiN might be useful (not if you're in a server, but if you're on the client side and doing something more interesting that could benefit from full HTML parsing.)

Ruben Bartelink
+2  A: 

If all the instances are similarly formatted, an expression like this

<!--[^(-->)]*-->(.*)<!--[^(-->)]*-->

would retrieve everything between two comments. If your "get html from here" text in your comments is well defined, you could be more specific:

<!--get html from here-->(.*)<!--get html from here-->

When you run the RegEx over the string, the Groups collection would contain the HTML between the comments.

Ben Von Handorf
That's wrong. `[^(-->)]` is a character class that matches any **one** character except one of `( ) - >`. You're probably thinking of a lookahead: `(?:(?!-->).)*` - zero or more of any character, unless the next three characters are `-->`. It's a very common mistake.
Alan Moore
You should probably also use the lazy quantifier *? for your captured expression since * is greedy and will happily eat a bunch of comments until it reaches the last one in the document.
Michael Petito
Good points, both.
Ben Von Handorf