views:

225

answers:

5

Consider the following Javascript regular expression matching operation:

"class1 MsoClass2\tmsoclass3\t MSOclass4 msoc5".match(/(^|\s)mso.*?(\s|$)/ig);

I would expect it to return [" MsoClass2\t", "\tmsoclass3\t", " MSOclass4 ", " msoc5"]. Instead it returns [" MsoClass2\t", " MSOclass4 "].

Why?

A: 

Because once it's matched " MsoClass2\t", the matcher is looking at the m in msoclass3, which doesn't match the initial space.

Simon Nickerson
A: 

This is becaue you are using ^ OR \s(whitespace) for first match while the string has NO whitespace for class 3. To get the results you want, use the following inside match():

/mso.*?(\s|$)/ig
Crimson
A: 

Hi,

I am not sure you can use something like (^|\s) and (\s|$), first -- maybe you can, but I have to thikn to understand the regex -- and it's never good when someone has to think to understand a regex : those are often quite too complicated :-(


If you want to match words that begins by "mso", be it upper or lowercase, I'd probably use something like this :

"class1 MsoClass2\tmsoclass3\t MSOclass4 msoc5".match(/\s?(mso[^\s]*)\s?/ig);

Which gets you :

[" MsoClass2 ", "msoclass3 ", " MSOclass4 ", "msoc5"]

Which is (almost : there are a couple white-spaces differences) what you asked.

Or, even simpler :

"class1 MsoClass2\tmsoclass3\t MSOclass4 msoc5".match(/(mso[^\s]*)/ig);

Which gets you :

["MsoClass2", "msoclass3", "MSOclass4", "msoc5"]

Whithout aby whitespace.


More easy to read / understand, too ;-)

Pascal MARTIN
(^|\s) and (\s|$) are legit
Nerdling
@Nerdling : thanks. (That's what I meant by "having to think" ^^ )
Pascal MARTIN
+2  A: 

The tabulator character before msoclass3 is already consumed by the first match " MsoClass2\t". Maybe you want to use a non-consuming look-ahead assertion instead:

/(^|\s)mso[^\s]*(?=\s|$)/
Gumbo
+2  A: 

Because the first match consumes the tab character, so there is no white space character left before the second MSO string. Same with the space after the second match.

Perhaps you want to match word boundaries instead of the separating characters. This code:

"class1 MsoClass2\tmsoclass3\t MSOclass4 msoc5".match(/\bmso.*?\b/ig)

will give you this result:

["MsoClass2","msoclass3","MSOclass4","msoc5"]
Guffa
Didn't know about the \b wildcard; very elegant!
Tim Molendijk