views:

796

answers:

4

I'm working on a specialized HTML stripper. The current stripper replaces <td> tags with tabs then <p> and <div> tags with double carriage-returns. However, when stripping code like this:

<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>

It (obviously) produces

First Text

Some Text

We'd like to have the <p> replaced with nothing in this case, so it produces:

First Text (tab) Some Text

However, we'd like to keep the double carriage-return replacement for other code where the <p> tag is not surrounded by <td> tags.

Basically, we're trying to replace <td> tags with \t always and <p> and <div> tags with \r\r ONLY when they're not surrounded by <td> tags.

Current code: (C#)

  // insert tabs in places of <TD> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);  

  // insert line paragraphs (double line breaks) in place
  // of <P>, <DIV> and <TR> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);

(there's more code to the stripper; this is the relevant part)

Any ideas on how to do this without completely rewriting the entire stripper?

EDIT: I'd prefer to not use a library due to the headaches of getting it signed off on and included with the project (which itself is a library to be included in another project), not to mention the legal issues. If there is no other solution, though, I'll probably use the HTML Agility Pack.

Mostly, the stripper just strips out anything it finds that looks like a tag (done with a large regex based on a regex in Regular Expressions Cookbook. This, replacing line break tags with /r, and dealing with multiple tabs is the brunt of the custom stripping code.

+2  A: 

Have you thought about looking into the HTML Agility Pack, which would have a lot of parsing options built in in which to manipulate tags?

Dillie-O
I'd prefer not to use a library; see above.
NickAldwin
A: 

I don't have an answer as far as writing it with Regular Expressions, but I'd highly recommend the HTML Agility Pack for something like this. You should be able to find the nodes easily with a simple selector and just replace them with whatever you want.

Chris Doggett
I'd prefer not to use a library; see above.
NickAldwin
A: 

So if you can't use the agility pack. What if you created a simple match that checked for the existence of the block. If it exists then you can do all the proper replacements for tags within the block, otherwise have a second set of replacements that works for tags not within the block.

No need to rewrite the existing replacements, just creating one more simple one for your other condition. I guess this would depend on how much text is getting parsed in one "unit" of HTML stripping.

Dillie-O
It varies between one line and an entire document.
NickAldwin
+2  A: 

Found the answer:

  // remove p/div/tr inside of td's
  result = System.Text.RegularExpressions.Regex.Replace(result, @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>.*?</td\b(?:[^>""']|""[^""]*""|'[^']*')*>", new MatchEvaluator(RemoveTagsWithinTD));

This code calls this separate method for each match:

  //a separate method
  private static string RemoveTagsWithinTD(Match matchResult) {
      return Regex.Replace(matchResult.Value, @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "");
    }

This code was (again) based on another recipe from the Regular Expressions Cookbook (which was sitting in front of me the whole time, d'oh!). It's really a great book.

NickAldwin
I'm glad to hear you like Regular Expressions Cookbook. If any of your friends don't have a copy yet, O'Reilly and I are doing a giveaway at regexguru.com in which anyone can participate until the end of the month (28 Feb 2010).
Jan Goyvaerts