tags:

views:

297

answers:

3

ok ive got this code:

public static string ScreenScrape(string url)
    {
        System.Net.WebRequest request = System.Net.WebRequest.Create(url);
        // set properties of the request
        using (System.Net.WebResponse response = request.GetResponse())
        {
            using (System.IO.StreamReader reader = new System.IO.StreamReader(response.GetResponseStream()))
            {
                return reader.ReadToEnd();
            }
        }
    }

Now I want to filter the text to get the div class="comment" ones is there another option other than using regular expressions? or is that the only way?

thanks

+8  A: 

You need to use the HTML Agility Pack.

For example:

var doc = new HtmlWeb().Load(url);
var comments = doc.Descendants("div")
                  .Where(div => div.GetAttributeValue("class", "") == "comment");

Note that this won't find <div class="OtherClass comment">; if you're looking for that, you can call IndexOf.

SLaks
I've used the agility pack a few times, parses almost anything - completely awesome!
Kieron
Note that it won't automatically process entities when reading `InnerText`; you'll need to call `HtmlEntity.DeEntitize` manually.
SLaks
A: 

You shoulnd't use regular expressions to parse HTML - they are the wrong tool for the job, as HTML it too complex for them.
You should use an HTML parser.
See also: Looking for C# HTML parser

Kobi
A: 

You first port of call should be the HTML Agility Pack.

Regular expressions are the classical way to parse this kind of input for non .NET languages.

Additionaly, if you can normalize this to an XML variant (i.e. XHTML), you can use XPATH to query and retrieve the required nodes.

What you do not want to do is implement your own parser.

Oded
**DO NOT PARSE HTML USING Regular Expressions.** **DO _NOT_ PARSE HTML USING Regular Expressions.** _DO **NOT** PARSE HTML USING Regular Expressions._ http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
SLaks