ansaurus

Question

.NET regex inner text between td, span, a tag

Answer 1

+6 A:

I cringe every time I hear the words regex and HTML in the same sentence. I would suggest checking out the HtmlAgilityPack on CodePlex which is a very tolerant HTML parser that lets you use XPath queries against the parsed document. It's much cleaner and the person that inherits your code will thank you!

EDIT

As per the comments below, here's some examples of how to get the InnerText of those tags. Very simple.

var doc = new HtmlDocument();
doc.LoadHtml("...your sample html...");

// all <td> tags in the document
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//td")) {
    Console.WriteLine(td.InnerText);
}

// all <span> tags in the document
foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span")) {
    Console.WriteLine(span.InnerText);
}

// all <a> tags in the document
foreach (HtmlNode a in doc.DocumentNode.SelectNodes("//a")) {
    Console.WriteLine(a.InnerText);
}

Josh Einstein 2010-05-20 06:38:09

+1, exactly what I was going to say!

Dean Harding 2010-05-20 06:41:09

Can you help me with the Xpath queries for the above parsing requirement.

mushtaqck 2010-05-20 06:41:18

I added a code example. I don't know how complex your XPath requirements are but I guarantee you it'll be much easier with XPath than Regex.

Josh Einstein 2010-05-20 06:51:25

Lucky you only cringe, unlike this guy, who totally lost it. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Igor Zevaka 2010-05-20 06:53:04

@Igor, oh my god that is hilarious. I bet a hooker got murdered that night.

Josh Einstein 2010-05-20 06:55:25

Answer 2

A:

You could use something like:

        const string pattern = @"[a|span|td]>\s*?(?<text>\w+?)\s*?</\w+>";
        Regex regex = new Regex(pattern, RegexOptions.Singleline);
        MatchCollection m = regex.Matches(x);
        List<string> list = new List<string>();

        foreach (Match match in m)
        {
            list.Add(match.Groups["text"].Value);
        }

John 2010-05-21 03:06:44

-1: you didn't try this, did you? You also didn't get the point that, in general, you cannot use regular expressions against HTML.

John Saunders 2010-05-21 03:10:02

Yes I did it works. Obviously YOU have not tried it. I disagree with not using regex for HTML. Everyone is entitled their opinion. This method conducts the same as above with 2 less loops. When parsing large HTML that is a benefit. But, hay thanks for the negative just because you're a douch. Besides, the poster requested regex not xpath, so that is what I gave.

John 2010-05-21 03:12:43

Did you realize that HTML is not a regular language, so regular expressions don't work in all cases? And, BTW, the `[abc]` syntax means, `a` or `b` or `c`, and so does `[a-c]`. You meant to use parentheses there: `(a | span | td)` is what you wanted.

John Saunders 2010-05-21 07:31:50

The XPath example could easily be reduced to one loop using the same alternation expression. I believe you also misspelled douche.

Josh Einstein 2010-05-21 11:44:18

ansaurus

tags:

views:

answers:

.NET regex inner text between td, span, a tag

related questions