tags:

views:

654

answers:

6

I want to get all HTML <p>...</p> in a document.
Using Regex to find all such strings using:

Regex regex = new Regex(@"\<p\>([^\>]*)\</p\>", RegexOptions.IgnoreCase);

But I am not able to get any result. Is there anything wrong with my regular expression.?

For now, I just want to get everything that comes in between <p>...</p> tags and want to use Regex for this as the source is not an HTML document.

+1  A: 

Using a regex for this is not the best idea. I suggest reading this thread:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

jonnii
Beat me by 29 seconds.
SLaks
+12  A: 

DO NOT PARSE HTML USING Regular Expressions!!!


Instead, use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Load(...);

var pTags = doc.DocumentNode.Descendants("p");

EDIT: You can do this even if the document isn't actually HTML.

SLaks
I love how the linked answer has become something of a meme.
Adam Robinson
My eleven hundredth answer!
SLaks
Congratulations :)
Amarghosh
It never ceases to amaze me how constant this topic is. And how easily 90+ rep can be earned… :-)
Tomalak
I actually only got 55 rep from this answer before hitting the cap.
SLaks
+1  A: 

The approach of using a regex to match HTML elements is destined to fail. A regular expression is not capable of reliably matching an HTML element. It's possible to build a more complex HTML element than your regex can match.

For example, i could beat your regex with the following

<p>hello<p>again</p></p>

Instead of using a regex you need to use an HTML (or potentially an XML) parser / DOM. This is the only way to reliably query an HTML file

Detailed Explanation of why:

JaredPar
+1  A: 

While others have said that you shouldn't be doing this with regular expressions, the reason yours is failing is that there is more HTML between your <p> tags and your exclusion of > is causing the Regex to not match.

Austin Salonen
I just want to get everything that comes in between <p>...</p> tags.What will be the right regex for this?
inutan
@inutan -- There isn't one that will work 100% of the time. See JaredPar's post.
Austin Salonen
A: 

You asked for it but really don't do this using Regexps unless you control 100% of the HTML production...

public static Regex regex = new Regex(
      "(?<open>\\<p(?<attr>[^>])*\\>)(?<content>.*)\\</p(?:\\s*)\\>",
    RegexOptions.Multiline
    | RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );

tested against

<p>hello world</p>
<p style="Foo"></p >
<p>who nests paragraphs <p>in 2010?</p> </p  >
<p /><p><a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</a></p><p/>

will yield for the content group

"hello world"
""
"who nests paragraphs <p>in 2010?</p>"
"<p><a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</a>"

so if you are sure there are no <p/> go for it

Florian Doyon
+1  A: 
@"(?is)<p>(?>(?:(?!</?p>).)*)</p>"

(?:(?!</?p>).)* matches one character at a time, after doing a lookahead to make sure it isn't part of a <p> or </p> tag.

(?>...) is an atomic group; it prevents backtracking that we know would be pointless.

(?is) is an alternative mechanism for specifying match modifiers--in this case, IgnoreCase and Singleline (the latter in case there are linefeeds or carriage returns between the tags, which would be redundant, but you did say it's not really HTML).

By the way, < and > have no special meaning in regexes, so there's no need to escape them. In fact, in some flavors you can give them special meanings by escaping them: \< and \> mean "beginning of word" and "end of word" respectively. But in .NET regexes the backslashes are just clutter.

Alan Moore