tags:

views:

416

answers:

2
<table >
    <tr>
        <td colspan="2" style="height: 14px">
            tdtext1
            <a>hyperlinktext1<a/> 
        </td>
    </tr>
    <tr>
        <td>
            tdtext2
        </td>
        <td>
            <span>spantext1</span>
        </td>
    </tr>
</table>   

This is my sample text how to write a regular expression in C# to get the matches for the innertext for td, span, hyperlinks.

+6  A: 

I cringe every time I hear the words regex and HTML in the same sentence. I would suggest checking out the HtmlAgilityPack on CodePlex which is a very tolerant HTML parser that lets you use XPath queries against the parsed document. It's much cleaner and the person that inherits your code will thank you!

EDIT

As per the comments below, here's some examples of how to get the InnerText of those tags. Very simple.

var doc = new HtmlDocument();
doc.LoadHtml("...your sample html...");

// all <td> tags in the document
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//td")) {
    Console.WriteLine(td.InnerText);
}

// all <span> tags in the document
foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span")) {
    Console.WriteLine(span.InnerText);
}

// all <a> tags in the document
foreach (HtmlNode a in doc.DocumentNode.SelectNodes("//a")) {
    Console.WriteLine(a.InnerText);
}
Josh Einstein
+1, exactly what I was going to say!
Dean Harding
Can you help me with the Xpath queries for the above parsing requirement.
mushtaqck
I added a code example. I don't know how complex your XPath requirements are but I guarantee you it'll be much easier with XPath than Regex.
Josh Einstein
Lucky you only cringe, unlike this guy, who totally lost it. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Igor Zevaka
@Igor, oh my god that is hilarious. I bet a hooker got murdered that night.
Josh Einstein
A: 

You could use something like:

        const string pattern = @"[a|span|td]>\s*?(?<text>\w+?)\s*?</\w+>";
        Regex regex = new Regex(pattern, RegexOptions.Singleline);
        MatchCollection m = regex.Matches(x);
        List<string> list = new List<string>();

        foreach (Match match in m)
        {
            list.Add(match.Groups["text"].Value);
        }
John
-1: you didn't try this, did you? You also didn't get the point that, in general, you cannot use regular expressions against HTML.
John Saunders
Yes I did it works. Obviously YOU have not tried it. I disagree with not using regex for HTML. Everyone is entitled their opinion. This method conducts the same as above with 2 less loops. When parsing large HTML that is a benefit. But, hay thanks for the negative just because you're a douch. Besides, the poster requested regex not xpath, so that is what I gave.
John
Did you realize that HTML is not a regular language, so regular expressions don't work in all cases? And, BTW, the `[abc]` syntax means, `a` or `b` or `c`, and so does `[a-c]`. You meant to use parentheses there: `(a | span | td)` is what you wanted.
John Saunders
The XPath example could easily be reduced to one loop using the same alternation expression. I believe you also misspelled douche.
Josh Einstein