views:

72

answers:

2

I have a text area in my ASP.NET web application that is used to enter html code. Now there is also an button control, which upon clicking should retrieve just the text placed between certain html tags in the text box.

For example:

1) User types html code including tags etc. and clicks the OK button 2) In my code the text from the text area is retrieved and only the part between a <p></p> tag should be saved to a string object.

I can obviously get the text from the text area and attach it to a string object but I am not able to work out how to just get text within a certain html tag like <p></p>. Could someone help me out please?

A: 

you might to look at the following http://stackoverflow.com/questions/1349023/how-can-i-strip-html-from-text-in-net

PieterG
Again, I don't want to remove the HTML tags. I just want to get text WITHIN a '<p></p>' tag.
Romulus
+2  A: 

Try this... example taken from MSDN and amended slightly to show your situation:

using System;
using System.Text.RegularExpressions;

class Example 
{
   static void Main() 
   {
      string text = "start <p>I want to capture this</p> end";
      string pat = @""<p>((?:.|\r|\n)+?)</p>"";

      // Instantiate the regular expression object.
      Regex r = new Regex(pat, RegexOptions.IgnoreCase);

      // Match the regular expression pattern against a text string.
      Match m = r.Match(text);
      int matchCount = 0;
      while (m.Success) 
      {
         Console.WriteLine("Match"+ (++matchCount));
         for (int i = 1; i <= 2; i++) 
         {
            Group g = m.Groups[i];
            Console.WriteLine("Group"+i+"='" + g + "'");
            CaptureCollection cc = g.Captures;
            for (int j = 0; j < cc.Count; j++) 
            {
               Capture c = cc[j];
               System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);
            }
         }
         m = m.NextMatch();
      }
   }
}

You can see this in action at ideone.com.

If you want to include your <p> tags in the result, then just change where you put the brackets in the regular expression to this:

string pat = @"(<p>(?:.|\r|\n)+?</p>)";
BG100
@BG100 it would help the original poster if you showed some sample output from this code. Then they might be able to see whether this is what they want.
Daniel Dyson
Yep.... added a link to the code running at ideone.com.
BG100
+1 fore the ideone link.
Robaticus
Hi, Thanks for that answer. It looks suitable for my situation - I am not sure what the differnce between 'group' and 'capture' is in your code though?
Romulus
In RegEx's, the captures are divided into capture groups. In this scenario, if you had more than one set of <p> tags, then you would get two captures, for the same capture group.
BG100
Excellent. Looks like exactly the thing I need.
Romulus
No prob.... glad to help.
BG100
Hi,Just wondering - when I have two sets of <p></p> tags within the same set, this doesn't seem to work? - http://ideone.com/sxwIG
Romulus
You are correct, in fact there is a small change you need to make to the reg ex to fix this. Change (.+) to (.+?) to make it match the minimum number of characters instead of the maximum. I've updated my answer and ideone.com.
BG100
Great stuff, thanks!
Romulus
Hi, sorry to be a pain but I have one more question - when I get the HTML output from the textbox in C# into a string, it has all sorts of ugly escape characters and as a result (I think) your code does not work on this. http://ideone.com/5dWVV Any idea what I'm doing wrong?
Romulus
It's because the "." character in the RegEx matches any character except new lines (\r\n). Give me a few minutes to work it out and I'll come back to you...
BG100
Try this: http://ideone.com/gqquG you need to change the RegEx to "<p>((?:.|\r|\n)+?)</p>"
BG100