tags:

views:

297

answers:

3

Hi All

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.

Below is a fragment of the html I am interested in:

<span class="header">Number of People:</span>
<span class="peopleCount">1001</span>  <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>

Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).

I've searched stack over flow and found some code that could work:

http://stackoverflow.com/questions/378415/c-how-do-i-extract-a-string-of-text-that-lies-between-two-brackets

But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:

        string responseHtml; // this is already filled with html code above ^^
        string insideBrackets = null;


        Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

        Match match = regex.Match(responseHtml);
        if (match.Success)
        {
            insideBrackets = match.Groups["TextInsideBrackets"].Value;
            Console.WriteLine(insideBrackets);
        }

The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.

Thanks in advance!

+3  A: 

?<TextInsideBrackets> is incorrect

You need:

(?<TextInsideBrackets>...)
leppie
Thanks for the input!
Dal
+1  A: 

I assume you want to do a named capture.

You should use

Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");

and not

Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Jass
Thanks for the help, I see I was missing the round brackets!
Dal
+3  A: 

Try this one:

Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <

(I changed the group name to data)

Cheers, Florian

Florian Doyon
Fantastic!! it worked! Thank you!
Dal
Glad it worked out!Just a note, you can (and should) reuse the regex instance if you want to do this several times, even accross several threads. Regex are threadsafe and this perticular one gets a special treatment due to the RegexOptions.Compiled argument.Cheers
Florian Doyon