I'm working on a specialized HTML stripper. The current stripper replaces <td> tags with tabs then <p> and <div> tags with double carriage-returns. However, when stripping code like this:
<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>
It (obviously) produces
First Text
Some Text
We'd like to have the <p> replaced with nothing in this case, so it produces:
First Text (tab) Some Text
However, we'd like to keep the double carriage-return replacement for other code where the <p> tag is not surrounded by <td> tags.
Basically, we're trying to replace <td> tags with \t always and <p> and <div> tags with \r\r ONLY when they're not surrounded by <td> tags.
Current code: (C#)
// insert tabs in places of <TD> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line paragraphs (double line breaks) in place
// of <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
(there's more code to the stripper; this is the relevant part)
Any ideas on how to do this without completely rewriting the entire stripper?
EDIT: I'd prefer to not use a library due to the headaches of getting it signed off on and included with the project (which itself is a library to be included in another project), not to mention the legal issues. If there is no other solution, though, I'll probably use the HTML Agility Pack.
Mostly, the stripper just strips out anything it finds that looks like a tag (done with a large regex based on a regex in Regular Expressions Cookbook. This, replacing line break tags with /r, and dealing with multiple tabs is the brunt of the custom stripping code.