tags:

views:

51

answers:

6

Hello, I want a regex to match words that are delimited by double or more space characters, e.g.

ABC  DE  FGHIJ   KLM    NO  P  QRST

Notice double or more spaces between the alphabets. Writing regex for such a problem is easy as I only need the first 4 words, as we may search for a word by using \S+ or \S+?

However, for my problem, only 1 white space CAN occur in a word, for example

AB C  DE  FG HIJ   KLM    NO  P  QRST

Here AB C is a word and FG HIJ is a word as well. In short we want to isolate characters that are spearated by double or more white spaces, I tried using this regex,

.+?  +.+?  +.+?  +.+?  +

it matches very swiftly, but it takes too much time for strings it doesn't match. (4 matches are given as an example here, in practice I need to match more).

I am in a need of a better regex to accomplish this, so that all the backtracking can be avoided. [^ ]* is a regex which will match uptill a space is encountered. Can't we specify a negated character set where we continue matching in case of a single space and break when 2 are encountered? I've tried using positive lookahead but failed miserably.

I would really appreciate your help. Thanks in advance.

Saad

A: 

Why not something like \s\s+ (one whitespace character, then one or more whitespace characters)?

Edit: it strikes me that whatever language/toolkit you're using may not support "splitting" a string using a regex directly. In that case, you may want to implement that functionality, and instead of attempting to match the WORDS in the input, match the SPACES, and use the information from those matches (position, length) to extract the words between the matches. In some languages (.NET, others) this functionality is built-in.

Mark
A: 

If you want to match all the words (allowing one space in a row), try \S+(?:[ ]\S+)* (the character class isn't necessary and can just be a space character, but I included it for clarity). It specifies that at least one non-whitespace character is required, and a space cannot be followed by another one.

You didn't mention what language you're using, but here's an example in PHP:

$string = "AB C  DE  FG HIJ   KLM    NO  P  QRST";
$matches = array();
preg_match_all('/\S+(?:[ ]\S+)*/', $string, $matches);
// $matches will contain 'AB C', 'DE', 'FG HIJ', 'KLM', 'NO', 'P', 'QRST'

If the requirements are at most one space per word, just change the * at the end to a ?: \S+(?:[ ]\S+)?.

Daniel Vandersluis
+1  A: 

if you know what the delimiter is (\s\s+), you could split instead of match. Simply split on two or more spaces.

Regards

rbo

rubber boots
A: 

What about using \s{2,}

jhericks
A: 

I think this is even more simple to match 2 or more whitespaces:

\s{2,}

In PHP the split would look like this

$list = preg_split('/\s{2,}/', $string);

Gedrox
+1  A: 

The simplest solution is to split on \s{2,} to get the "words" you want, but if you insist on scanning for the tokens, then where as before you have \S+, what you have now is \S+(\s\S+)*. That's exactly what it says: \S+, followed by zero or more (\s\S+). You can use non-capturing group for performance, i.e. \S+(?:\s\S+)*. You can even make each repetition possessive if your flavor supports it for extra boost, i.e. \S++(?:\s\S++)*+.

Here's a Java snippet to demonstrate:

    String text = "AB C  DE  FG HIJ   KLM    NO  P  QRST";
    Matcher m = Pattern.compile("\\S++(?:\\s\\S++)*+").matcher(text);
    while (m.find()) {
        System.out.println("[" + m.group() + "]");
    }

This prints:

[AB C]
[DE]
[FG HIJ]
[KLM]
[NO]
[P]
[QRST]

You can of course substitute just the space character instead of \s if that's your requirement.

References

polygenelubricants
FWIW, `.+? +.+? +.+? +.+? +` backtracks catastrophically, http://www.regular-expressions.info/catastrophic.html ; Don't abuse `.` like this.
polygenelubricants
polygenelubricants, thanks for the response. I always knew that was an abuse of . but I was not finding a better soln. Your soln is way better because it specifies the pattern with much more detail. Thanks.
Saad Khakwani
I was just wondering if we need to allow 2 spaces in our WORDS and delimit with 3 or more spaces the regex would be \S+?(\s{1,2}\S+?)*?Also, if I need to delimit with a combination of letters for example **poly**, what would be the regex .*?poly offers same backtracking. Any comments polygenelubricants.
Saad Khakwani
@Saad: to allow up to 2 spaces in the "words", yes, just use `\s{1,2}` instead of simply `\s`. However, all the reluctant modifier in your comment is not necessary, since the pattern is so specific that there's only one way to match anyway. I don't really understand your comment, but you may want to check out http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532
polygenelubricants