\s
by default, will not match whitespace characters with values greater than 128. To get at those, you can instead make good use of other UTF-8-aware sequences.
(Standard disclaimer: I'm skimming the PCRE source code to compile the lists below, I may miss a character or type something incorrectly. Please forgive me.)
\p{Zs}
matches:
- U+0020 Space
- U+00A0 No-break space
- U+1680 Ogham space mark
- U+180E Mongolian vowel separator
- U+2000 En quad
- U+2001 Em quad
- U+2002 En space
- U+2003 Em space
- U+2004 Three-per-em space
- U+2005 Four-per-em space
- U+2006 Six-per-em space
- U+2007 Figure space
- U+2008 Punctuation space
- U+2009 Thin space
- U+200A Hair space
- U+202F Narrow no-break space
- U+205F Medium mathematical space
- U+3000 Ideographic space
\h
(Horizontal whitespace) matches the same as \p{Zs}
above, plus
Similarly for matching vertical whitespace there are a few options.
\p{Zl}
matches U+2028 Line separator.
\p{Zp}
matches U+2029 Paragraph separator.
\v
(Vertical whitespace) matches \p{Zl}
, \p{Zp}
and the following
- U+000A Linefeed
- U+000B Vertical tab
- U+000C Formfeed
- U+000D Carriage return
- U+0085 Next line
Going back to the beginning, in UTF-8 mode (i.e. using the u
pattern modifier) \s
will match any character that \p{Z}
matches (which is anything that \p{Zs}
, \p{Zl}
and \p{Zp}
will match), plus
- U+0009 Horizontal tab
- U+000A Linefeed
- U+000C Formfeed
- U+000D Carriage return
To cut a long story short (I bet you read all of the above, didn't you?) you might want to use \s
but make sure to be in UTF-8 mode like /\s/u
. Putting that to some practical use, to filter out those matching whitespace characters from a string you would do something like
$new_string = preg_replace('/\s/u', '', $old_string);
Finally, if you really, really care about the vertical whitespaces which aren't included in \s
(LF and NEL) then you can use the character class [\s\v]
to match all 26 of the whitespace characters listed above.