tags:

views:

175

answers:

3

I have the following HTML:

<h1>Text Text</h1>      <h2>Text Text</h2>

I am still trying to get a handle on regular expressions, and trying to create one that would eliminate the spacing between the tags.

I would like the final result to be:

<h1>Text Text</h1><h2>Text Text</h2>

Any help would be greatly appreciated!

UPDATE

I would like to strip out all white spaces, tabs and new lines. So if I have:

<div>    <h1>Text Text</h1>      <h2>Text Text</h2>     </div>

I would like it to end up as:

<div><h1>Text Text</h1><h2>Text Text</h2></div>
+1  A: 

If it's just this specific case, here's a suitable regex to find all the spaces:

Regex regexForBreaks = new Regex(@"h1>[\s]*<h2", RegexOptions.Compiled);

However, I think a regex is the wrong approach here if this is a more general case. For example, it's possible for tags to be nested within other tags, and then your problem needs a little more detail to figure out the right answer. As Jamie Zawinski said, "Some people, when confronted with a problem, think, 'I know, I'll use regular expressions.' Now they have two problems."

John Feminella
Not sure I understand that last bit. Remove h1 and h2 and you've got the general case, what additional problem do you percieve?
AnthonyWJones
Good point! I just want to eliminate the white spaces, new lines and tabs.
mattruma
@AnthonyWJones: You can't do that. Imagine this case: "<pre><div>foo</div> bar <div>baz</div></pre>". The whitespace is intentional here and removing it will change the meaning.
John Feminella
A: 

One alternative to using a regex or string replace is the Html Agility pack.

Here's a rough guess:

/// <summary>
///  Regular expression built for C# on: Tue, Sep 1, 2009, 03:56:27 PM
///  Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  <h1>
///      <h1>
///  [1]: A numbered capture group. [.+]
///      Any character, one or more repetitions
///  </h1>
///      </h1>
///  Match expression but don't capture it. [\s*]
///      Whitespace, any number of repetitions
///  <h2>
///      <h2>
///  [2]: A numbered capture group. [.+]
///      Any character, one or more repetitions
///  </h2>
///      </h2>
///  
///
/// </summary>
public static Regex regex = new Regex(
      "<h1>(.+)</h1>(?:\\s*)<h2>(.+)</h2>",
    RegexOptions.Singleline
    | RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "<h1>$1</h1><h2>$2</h2>";
Chris S
A: 

How about: Regex.Replace(str, @">\s+<","><")

aquinas
Misses situations where you have legitimate square bracket characters in between elements: `<element> > </element>`
Welbog
Addendum: By "misses", I mean it's overzealous. It will remove the space between `>` and `</element> even though it should not.
Welbog
Is "<element> > </element>" even valid HTML? Don't you have to use a reference (>) for angled braces inside the text of an element?
Darryl
The closing bracket is valid, the open bracket isn't.
Welbog