views:

77

answers:

4

Hi,

I have html stored in a string variable within my c# .net 2.0 code. Below is an example:

<div class="track">
    <img alt="" src="http://hits.guardian.co.uk/b/ss/guardiangu-feeds/1/H.20.3/30561?ns=guardian&amp;pageName=Hundreds+feared+dead+in+Haiti+quake%3AArticle%3A1336252&amp;ch=World+news&amp;c3=GU.co.uk&amp;c4=Haiti+%28News%29%2CDominican+Republic+%28News%29%2CCuba+%28News%29%2CBahamas+%28News%29%2CNatural+disasters+and+extreme+weather+%28News%29%2CEnvironment%2CWorld+news&amp;c6=Rory+Carroll%2CHaroon+Siddique&amp;c7=10-Jan-13&amp;c8=1336252&amp;c9=Article&amp;c10=News&amp;c11=World+news&amp;c13=&amp;c25=&amp;c30=content&amp;h2=GU%2FWorld+news%2FHaiti" width="1" height="1" />
</div>
<p class="standfirst">
    • Tens of thousands lose homes in 7.0 magnitude quake<br />
    • UN headquarters, schools and hospitals collapse
</p>
<p>
    René Préval, the president of Haiti, has described the devastation after last night's earthquake as "unimaginable" as governments and aid agencies around the world rushed into action.
</p>
<p>
    Préval described how he had been forced to step over dead bodies and heard the cries of those trapped under the rubble of the national parliament. "Parliament has collapsed. The tax office has collapsed. Schools have collapsed. Hospitals have collapsed," <a href="http://www.miamiherald.com/582/story/1422279.html" title="he told the Miami Herald">he told the Miami Herald</a>. "There are a lot of schools that have a lot of dead people in them." Préval said he thought thousands of people had died in the quake.
</p>

I only want to output the first two paragraphs as a substring of the orginal.

Can someone help?

A: 

Are you using JavaScript? You could use explode on the p tags to get the div + first para in one piece of the array, and each of the p tags in their respective elements.

Kris.Mitchell
A: 

You can write a few methods that loads the HTML into a webbrowser variable and then use the DOM to traverse the nodes and pull out what ever custom logic you want. Check out this tutorial.

Here is a snippet of how to create the webbroswer in the code behind rather than how the tutorial tells you how to do it:

using System.Windows.Forms;

WebBrowser _Browser = null;
string _Source = "Your HTML goes here";

_Browser = new WebBrowser();
_Browser.Navigate("about:Blank");
_Browser.Document.OpenNew(true);
_Browser.Document.Write(_Source);
Ioxp
+1  A: 

Have a look at the Html Agility Pack.

It exposes a very powerful API for parsing HTML which can be used to extract the data you want.

AdamRalph
+1 Thanks for the link
Ioxp
A: 

I used this function in the end ...

  private string GetFirstParagraph(string htmltext)
        {
            Match m = Regex.Match(htmltext, @"<p>\s*(.+?)\s*</p>");
            if (m.Success)
            {
                return m.Groups[1].Value;
            }
            else
            {
                return htmltext;
            }
        }
test