ansaurus

Question

Matching duplicate whitespace with preg_replace

Answer 1

A:

The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.

There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.

I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?

thomasrutter 2010-06-29 02:00:50

The content-type header is set to `charset=UTF-8`, the mysql database collation is set to utf8_general_ci, and the reading settings in wordpress itself are set to UTF-8. So I really don't understand how a regular space character is being interpreted this way. It's not like I have some weird data source. I've typed in the data myself.

jjeaton 2010-06-30 03:01:30

Maybe you could put up a working demo somewhere online - someone may be able to see what it's doing and help you.

thomasrutter 2010-07-01 00:21:16

My comment above on the question has a link to some sample code with results.

jjeaton 2010-07-09 04:18:59

Answer 2

A:

Don't know about any modifiers, but this did the trick:

<?php
$text = ' Hi,   my name is    Andrés.  ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>

misterte 2010-06-29 02:01:18

Still doesn't work for me, unfortunately. I tried just using the `/\s{2,}/` also, and it didn't match anything for me. Maybe there is something wrong with my wordpress/php setup?

jjeaton 2010-06-30 02:50:12

Where are you getting your text from?

misterte 2010-06-30 16:48:45

Let me be more specific:Yo should let php know 'what' you are sending and retrieving from the DB. After any connection and before any query you should state msyql_set_charset('utf8', $connection_resource);

misterte 2010-06-30 21:34:54

Answer 3

A:

preg_replace('!\s+!', ' ', 'This sentence  has extra space.  This doesn’t.  Extra  space, Lots          of extra space.');

fabrik 2010-06-29 11:20:43

This doesn't work either.

jjeaton 2010-06-30 03:02:57

Answer 4

+1 A:

This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:

return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);

You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.

I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.

I'm not sure if using echo in a WordPress filter is a good idea.

Jan Goyvaerts 2010-07-12 08:01:25

This worked! Thank you! I'm wondering if it was non-breaking spaces, although I didn't see them in the HTML source. I agree on the use of `echo` it was only in there for debugging purposes to count the number of matches. The thing I don't understand is why all the built-in wordpress functions that operate on the same database (always defaults to utf-8) don't have to use the `/u` flag. See the `wp_texturize()` function for an example: http://wordpress.taragana.net/wp-includes/formatting.php.source.html#l3

jjeaton 2010-07-14 02:06:59

A regex that only works on ASCII characters (bytes 0 to 127) will work correctly on a UTF-8 string even without `/u` because UTF-8 is specifically designed to be transparent to processes that understand only ASCII and ignore bytes > 127.

Jan Goyvaerts 2010-07-14 03:01:42

Depending on how you look at the HTML source, you may not "see" non-breaking spaces because they look just like regular spaces.

Jan Goyvaerts 2010-07-14 03:03:10

I understand, however some of their functions match whitespace also `\s*` and seem to work without trying to match Unicode whitespace.

jjeaton 2010-07-14 12:58:10

In PCRE, `\s` only matches ASCII whitespace, so it's not affected by the `/u` flag.

Jan Goyvaerts 2010-07-17 14:31:48

Is there a way to modify this to replace whichever whitespace was matched with itself? e.g. This will match duplicate newlines and replace it with a single space. It would be better if it replaced duplicate spaces with a space, duplicate newlines with one newline, etc.

jjeaton 2010-07-18 23:20:05

`return preg_replace('/([\p{Z}\s])\1+/u', '$1', $text);` But now, of course, it won't match runs of *heterogeneous* whitespace.

Alan Moore 2010-07-19 02:37:36

Answer 5

A:

To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.

return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);

This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.

Jan Goyvaerts 2010-07-19 02:24:35

ansaurus

tags:

views:

answers:

Matching duplicate whitespace with preg_replace

related questions