ansaurus

Question

How to strip characters between HTML tags

Answer 1

+1 A:

If it's just this specific case, here's a suitable regex to find all the spaces:

Regex regexForBreaks = new Regex(@"h1>[\s]*<h2", RegexOptions.Compiled);

However, I think a regex is the wrong approach here if this is a more general case. For example, it's possible for tags to be nested within other tags, and then your problem needs a little more detail to figure out the right answer. As Jamie Zawinski said, "Some people, when confronted with a problem, think, 'I know, I'll use regular expressions.' Now they have two problems."

John Feminella 2009-09-01 14:55:00

Not sure I understand that last bit. Remove h1 and h2 and you've got the general case, what additional problem do you percieve?

AnthonyWJones 2009-09-01 15:01:07

Good point! I just want to eliminate the white spaces, new lines and tabs.

mattruma 2009-09-01 15:08:18

@AnthonyWJones: You can't do that. Imagine this case: "<pre><div>foo</div> bar <div>baz</div></pre>". The whitespace is intentional here and removing it will change the meaning.

John Feminella 2009-09-01 15:13:34

Answer 2

A:

One alternative to using a regex or string replace is the Html Agility pack.

Here's a rough guess:

/// <summary>
///  Regular expression built for C# on: Tue, Sep 1, 2009, 03:56:27 PM
///  Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  <h1>
///      <h1>
///  [1]: A numbered capture group. [.+]
///      Any character, one or more repetitions
///  </h1>
///      </h1>
///  Match expression but don't capture it. [\s*]
///      Whitespace, any number of repetitions
///  <h2>
///      <h2>
///  [2]: A numbered capture group. [.+]
///      Any character, one or more repetitions
///  </h2>
///      </h2>
///  
///
/// </summary>
public static Regex regex = new Regex(
      "<h1>(.+)</h1>(?:\\s*)<h2>(.+)</h2>",
    RegexOptions.Singleline
    | RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "<h1>$1</h1><h2>$2</h2>";

Chris S 2009-09-01 14:55:50

Answer 3

A:

How about: Regex.Replace(str, @">\s+<","><")

aquinas 2009-09-01 14:57:14

Misses situations where you have legitimate square bracket characters in between elements: `<element> > </element>`

Welbog 2009-09-01 15:11:42

Addendum: By "misses", I mean it's overzealous. It will remove the space between `>` and `</element> even though it should not.

Welbog 2009-09-01 15:12:48

Is "<element> > </element>" even valid HTML? Don't you have to use a reference (>) for angled braces inside the text of an element?

Darryl 2009-09-01 15:37:00

The closing bracket is valid, the open bracket isn't.

Welbog 2009-09-01 15:49:46

ansaurus

tags:

views:

answers:

How to strip characters between HTML tags

related questions