ansaurus

Question

regex double white space separation problem

Answer 1

A:

Why not something like \s\s+ (one whitespace character, then one or more whitespace characters)?

Edit: it strikes me that whatever language/toolkit you're using may not support "splitting" a string using a regex directly. In that case, you may want to implement that functionality, and instead of attempting to match the WORDS in the input, match the SPACES, and use the information from those matches (position, length) to extract the words between the matches. In some languages (.NET, others) this functionality is built-in.

Mark 2010-08-10 14:02:17

Answer 2

A:

If you want to match all the words (allowing one space in a row), try \S+(?:[ ]\S+)* (the character class isn't necessary and can just be a space character, but I included it for clarity). It specifies that at least one non-whitespace character is required, and a space cannot be followed by another one.

You didn't mention what language you're using, but here's an example in PHP:

$string = "AB C  DE  FG HIJ   KLM    NO  P  QRST";
$matches = array();
preg_match_all('/\S+(?:[ ]\S+)*/', $string, $matches);
// $matches will contain 'AB C', 'DE', 'FG HIJ', 'KLM', 'NO', 'P', 'QRST'

If the requirements are at most one space per word, just change the * at the end to a ?: \S+(?:[ ]\S+)?.

Daniel Vandersluis 2010-08-10 14:03:15

Answer 3

+1 A:

if you know what the delimiter is (\s\s+), you could split instead of match. Simply split on two or more spaces.

Regards

rbo

rubber boots 2010-08-10 14:03:27

Answer 4

A:

What about using \s{2,}

jhericks 2010-08-10 14:04:33

Answer 5

A:

I think this is even more simple to match 2 or more whitespaces:

\s{2,}

In PHP the split would look like this

$list = preg_split('/\s{2,}/', $string);

Gedrox 2010-08-10 14:07:54

Answer 6

+1 A:

The simplest solution is to split on \s{2,} to get the "words" you want, but if you insist on scanning for the tokens, then where as before you have \S+, what you have now is \S+(\s\S+)*. That's exactly what it says: \S+, followed by zero or more (\s\S+). You can use non-capturing group for performance, i.e. \S+(?:\s\S+)*. You can even make each repetition possessive if your flavor supports it for extra boost, i.e. \S++(?:\s\S++)*+.

Here's a Java snippet to demonstrate:

    String text = "AB C  DE  FG HIJ   KLM    NO  P  QRST";
    Matcher m = Pattern.compile("\\S++(?:\\s\\S++)*+").matcher(text);
    while (m.find()) {
        System.out.println("[" + m.group() + "]");
    }

This prints:

[AB C]
[DE]
[FG HIJ]
[KLM]
[NO]
[P]
[QRST]

You can of course substitute just the space character instead of \s if that's your requirement.

References

regular-expressions.info/Character Class, Brackets for Grouping, Repetition, Possessive

polygenelubricants 2010-08-10 15:21:44

FWIW, `.+? +.+? +.+? +.+? +` backtracks catastrophically, http://www.regular-expressions.info/catastrophic.html ; Don't abuse `.` like this.

polygenelubricants 2010-08-10 15:37:42

polygenelubricants, thanks for the response. I always knew that was an abuse of . but I was not finding a better soln. Your soln is way better because it specifies the pattern with much more detail. Thanks.

Saad Khakwani 2010-08-11 07:37:15

I was just wondering if we need to allow 2 spaces in our WORDS and delimit with 3 or more spaces the regex would be \S+?(\s{1,2}\S+?)*?Also, if I need to delimit with a combination of letters for example **poly**, what would be the regex .*?poly offers same backtracking. Any comments polygenelubricants.

Saad Khakwani 2010-08-13 06:59:59

@Saad: to allow up to 2 spaces in the "words", yes, just use `\s{1,2}` instead of simply `\s`. However, all the reluctant modifier in your comment is not necessary, since the pattern is so specific that there's only one way to match anyway. I don't really understand your comment, but you may want to check out http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532

polygenelubricants 2010-08-13 07:05:29

ansaurus

tags:

views:

answers:

regex double white space separation problem

References

related questions