One possibility: Use an xhtml parser that fixes malformed xhtml. One such library is libxml2. Then use the library to locate and remove empty p tags.
Inspired by this excellent post:
(?# line 01) <(?<open>.+?)>
(?# line 02) (?>
(?# line 03) <(?<open>.+?)> (?<DEPTH>)
(?# line 04) |
(?# line 05) </\k<open>> (?<-DEPTH>)
(?# line 06) |
(?# line 07) .?
(?# line 08) )*
(?# line 09) (?(DEPTH)(?!))
(?# line 10) </\k<open>>
This will extract only the correctly matching tags, but not the self-closed ones; it will also do a basic nesting check, but not much else:
input:
<p>scet</p>
<p>sunny </p>
incorrect
<p>
<p>
<pre>mark</pre>
<p>Thomas </s>
<a>asd</a>
<p/>
<p><a>this should match</a></p>
<p><a>should not match</p></a>
output:
<p>scet</p>
<p>sunny </p>
<a>asd</a>
<p><a>this should match</a></p>
Each line of output is one match. However, tags containing attributes will of course not be included. A regular expression that would handle more cases correctly would be truly horrifying to look at, even with the nice formatting showcased in the blog I linked to :)
In these cases (especially since I gather you need valid XHTML output) I would always recommend running the input through a specialized parser, preferably one which outputs the parsing errors nicely, and handling those errors, instead of hacking regular expressions. Don't know any good (X)HTML parsers though, didn't need to do something like that in a very long time.
This will work, take html document in string xhtml
public static class XHTMLCleanerUpperThingy
{
private const string p = "<p>";
private const string closingp = "</p>";
public static string CleanUpXHTML(string xhtml)
{
StringBuilder builder = new StringBuilder(xhtml);
for (int idx = 0; idx < xhtml.Length; idx++)
{
int current;
if ((current = xhtml.IndexOf(p, idx)) != -1)
{
int idxofnext = xhtml.IndexOf(p, current + p.Length);
int idxofclose = xhtml.IndexOf(closingp, current);
// if there is a next <p> tag
if (idxofnext > 0)
{
// if the next closing tag is farther than the next <p> tag
if (idxofnext < idxofclose)
{
for (int j = 0; j < p.Length; j++)
{
builder[current + j] = ' ';
}
}
}
// if there is not a final closing tag
else if (idxofclose < 0)
{
for (int j = 0; j < p.Length; j++)
{
builder[current + j] = ' ';
}
}
}
}
return builder.ToString();
}
}