views:

90

answers:

5

I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.

My code looks like this:

return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
  • I don't understand why I need the u modifier. I've seen other plugins that use preg_replace and don't need to modify it for Unicode. I believe I have a default installation of WordPress .

  • Without the modifier, the code replaces all the spaces with Unicode replacement glyphs instead of spaces.

  • With the u modifier, I don't get the glyphs, and it doesn't replace all the whitespace.

Each space below has from 1-10 spaces. The regex only removes on space from each group.

Before:

This sentence  has extra space.  This doesn’t.  Extra  space, Lots          of extra space.

After:

This sentence has extra space. This doesn’t. Extra space, Lots         of extra space.

$count = 9

How can I make the regex replace the whole match with the one space?


Update: If I try this with regular php, it works fine

$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);

It only breaks when I use it within the wordpress plugin. I'm using this function in a filter:

function jje_test( $text ) {
    $new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
    echo "Count: $count";
    return $new_text;
}

add_filter('the_content', 'jje_test');

I have tried:

  • Removing all other filters on the_content
    remove_all_filters('the_content');
  • Changing the priority of the filter added to the_content, earlier or later
  • All kinds of permutations of \s+, \s\s+, [ ]+ etc.
  • Even replacing all single spaces with an empty string, will not replace the spaces
A: 

The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.

There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.

I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?

thomasrutter
The content-type header is set to `charset=UTF-8`, the mysql database collation is set to utf8_general_ci, and the reading settings in wordpress itself are set to UTF-8. So I really don't understand how a regular space character is being interpreted this way. It's not like I have some weird data source. I've typed in the data myself.
jjeaton
Maybe you could put up a working demo somewhere online - someone may be able to see what it's doing and help you.
thomasrutter
My comment above on the question has a link to some sample code with results.
jjeaton
A: 

Don't know about any modifiers, but this did the trick:

<?php
$text = ' Hi,   my name is    Andrés.  ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>
misterte
Still doesn't work for me, unfortunately. I tried just using the `/\s{2,}/` also, and it didn't match anything for me. Maybe there is something wrong with my wordpress/php setup?
jjeaton
Where are you getting your text from?
misterte
Let me be more specific:Yo should let php know 'what' you are sending and retrieving from the DB. After any connection and before any query you should state msyql_set_charset('utf8', $connection_resource);
misterte
A: 
preg_replace('!\s+!', ' ', 'This sentence  has extra space.  This doesn’t.  Extra  space, Lots          of extra space.');
fabrik
This doesn't work either.
jjeaton
+1  A: 

This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:

return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);

You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.

I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.

I'm not sure if using echo in a WordPress filter is a good idea.

Jan Goyvaerts
This worked! Thank you! I'm wondering if it was non-breaking spaces, although I didn't see them in the HTML source. I agree on the use of `echo` it was only in there for debugging purposes to count the number of matches. The thing I don't understand is why all the built-in wordpress functions that operate on the same database (always defaults to utf-8) don't have to use the `/u` flag. See the `wp_texturize()` function for an example: http://wordpress.taragana.net/wp-includes/formatting.php.source.html#l3
jjeaton
A regex that only works on ASCII characters (bytes 0 to 127) will work correctly on a UTF-8 string even without `/u` because UTF-8 is specifically designed to be transparent to processes that understand only ASCII and ignore bytes > 127.
Jan Goyvaerts
Depending on how you look at the HTML source, you may not "see" non-breaking spaces because they look just like regular spaces.
Jan Goyvaerts
I understand, however some of their functions match whitespace also `\s*` and seem to work without trying to match Unicode whitespace.
jjeaton
In PCRE, `\s` only matches ASCII whitespace, so it's not affected by the `/u` flag.
Jan Goyvaerts
Is there a way to modify this to replace whichever whitespace was matched with itself? e.g. This will match duplicate newlines and replace it with a single space. It would be better if it replaced duplicate spaces with a space, duplicate newlines with one newline, etc.
jjeaton
`return preg_replace('/([\p{Z}\s])\1+/u', '$1', $text);` But now, of course, it won't match runs of *heterogeneous* whitespace.
Alan Moore
A: 

To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.

return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);

This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.

Jan Goyvaerts