views:

2272

answers:

2

Is there any way in VB.NET to remove all of the whitespaces between tags in HTML?

Say, I've got this:

<tr>
    <td>

The string I've built is an entire HTML document, and it counts everything before those tags as legitimate space, so I need to trim it out. Is there a reg ex or function out there I could use to do this?

Thanks

A: 

Depending on the complexity of your document, you probably just need a replace regular expression across the document... Something like:

RegexObj.Replace(">[\s\n]*<","><")

You can read up about .NET and regular expressions here

Works amazingly well. Thank you.
Joe Morgan
Note this will also remove legitimate spaces inside tags, for example: <td> </td> Would have rendered one space. Now it won't. Some browsers in certain circumstances will then display the cell completley differently because it's empty.
Joel Coehoorn
'>' is not always the end of a tag, it can be included unescaped in text, attribute values and many other places. As always, regex is the wrong tool for processing [X]HTML.
bobince
A: 

The above solution is a good start, but the code is slightly wrong and the regular expression is more than it needs to be. Here's the minimum that you would need to do in this case:

Dim RegexObj As New Regex(">[\s]*<")

NewText = RegexObj.Replace(OldText, "><")

The \n is unnecessary because .Net includes the carriage return and line feed characters in the set of whitespace characters (\s). Not sure about other languages. And if it didn't, you would also need to include the \r character because a Windows newline is \r\n in a regex, not just \n.