views:

512

answers:

2

Hi

I have this HTML string which often has a lot of whitespaces

Example:

<p>All     these words <br />
<strong>All</strong>   <em>these</em>   words
<pre>    All    these words</pre>
</p>

I need to remove them using JavaScript, and have come up with this regEx:

String.replace(/ {2,}/g, '');

Which seems to do the job with replacing unwanted whitespaces, but I want to preserve the whitespaces inside the PRE element.

Is this possible with a regex?

+1  A: 

use:

String.replace(/(<pre[\s\S]*?>[\s\S]*?<\/pre>)| {2,}/ig, '$1')

tested in firefox 3

edit:

see test page here: http://ashita.org/StackOverflow/replacetest.html

Jonathan Fingland
Thanks! Seems to work, expect that expected spaces between words is colapsed. Using '$1 ' seems to fix it somehow.
And we may use <pre.*?> in case the PRE element has any attributes
it works by first trying to match the pre, then matching the spaces. The pre is inside a capture group (the parentheses). when the pre section is caught, the $1 ensures it is replaced by itself. when 2 consecutive spaces are found, the $1 inserts nothing as the capture group would be empty in this case.
Jonathan Fingland
good point pokemon. edited code
Jonathan Fingland
+5  A: 

You can't do this with regular expressions and it's as simple as that.

Regular expressions are a poor choice for this kind of thing. <pre> blocks can contain other tags and so forth. Also what about CSS (either inline or with classes) that uses the white-space: pre property?

HTML and browsers handle white-space just fine. Is this really a problem you need to solve? If you do, you need an HTML parser of some kind.

cletus
While you make a good point cletus, the question was not about elements with white-space:pre. It also didn't ask for elements inside the pre to be treated differently than the pre
Jonathan Fingland
Yes it did. Inside the pre, the whitespace should not be removed, in contrast to outside.
Svante
@svante: who are you disagreeing with?
Jonathan Fingland