tags:

views:

53

answers:

3

Hi,

I know that there are many types of space (em space, en space, thin space, non-breaking space, etc), but, all these, that I refered, have HTML entities (at least, PHP's htmlentities() return something like  .

But, what about those spaces that have no HTML entities?
Example: http://iorbix.com/social/display-profile.php?id=711236966275&name=Nuno-Peralta
Look at the nickname of this account. It has many " " (spaces) at the front, which are visible for us (this doesn't happen with the  ).

I tried already filter with regular expressions, using \x escape, filter with str_replace(), with the space as the argument, and no luck at all!

Do you have any suggestion on how to filter ALL types of whitespace?
Thanks,
Nuno Peralta

+1  A: 
$result = preg_replace('/\s/', '', $yourString)

See http://www.php.net/manual/en/regexp.reference.backslash.php for more infos on the \s

DrColossos
+1  A: 

They are all plain spaces (returning character code 32) that can be caught with regular expressions or trim().

Try this:

preg_replace("/\s{2,}/", " ", $text);
animuson
I use (mb_)trim() by default for all user input, and I tried \s already, and both do not work. Thanks for your help.
Nuno Peralta
+2  A: 

\s by default, will not match whitespace characters with values greater than 128. To get at those, you can instead make good use of other UTF-8-aware sequences.


(Standard disclaimer: I'm skimming the PCRE source code to compile the lists below, I may miss a character or type something incorrectly. Please forgive me.)

\p{Zs} matches:

  • U+0020 Space
  • U+00A0 No-break space
  • U+1680 Ogham space mark
  • U+180E Mongolian vowel separator
  • U+2000 En quad
  • U+2001 Em quad
  • U+2002 En space
  • U+2003 Em space
  • U+2004 Three-per-em space
  • U+2005 Four-per-em space
  • U+2006 Six-per-em space
  • U+2007 Figure space
  • U+2008 Punctuation space
  • U+2009 Thin space
  • U+200A Hair space
  • U+202F Narrow no-break space
  • U+205F Medium mathematical space
  • U+3000 Ideographic space

\h (Horizontal whitespace) matches the same as \p{Zs} above, plus

  • U+0009 Horizontal tab.

Similarly for matching vertical whitespace there are a few options.

\p{Zl} matches U+2028 Line separator.

\p{Zp} matches U+2029 Paragraph separator.

\v (Vertical whitespace) matches \p{Zl}, \p{Zp} and the following

  • U+000A Linefeed
  • U+000B Vertical tab
  • U+000C Formfeed
  • U+000D Carriage return
  • U+0085 Next line

Going back to the beginning, in UTF-8 mode (i.e. using the u pattern modifier) \s will match any character that \p{Z} matches (which is anything that \p{Zs}, \p{Zl} and \p{Zp} will match), plus

  • U+0009 Horizontal tab
  • U+000A Linefeed
  • U+000C Formfeed
  • U+000D Carriage return

To cut a long story short (I bet you read all of the above, didn't you?) you might want to use \s but make sure to be in UTF-8 mode like /\s/u. Putting that to some practical use, to filter out those matching whitespace characters from a string you would do something like

$new_string = preg_replace('/\s/u', '', $old_string);

Finally, if you really, really care about the vertical whitespaces which aren't included in \s (LF and NEL) then you can use the character class [\s\v] to match all 26 of the whitespace characters listed above.

salathe
wow! I read all, yes, but, seems that only when I inserted "\h", it worked, so, the final RE is: /[\s\h\v]/u - Thank you allot!!
Nuno Peralta