views:

142

answers:

2

Given the following Regular Expression:

\b(MyString|MyString-Dash)\b

And the text:

AString MyString MyString-Dash

Running a match against the text never finds a match for the second thing (MyString-Dash) because the '-' (dash) character isn't a word boundary character. The following javascript always outputs "MyString,MyString" to the "matches" div (I would like to find MyString and MyString-Dash as distinct matches). How can I define a pattern that will match both MyString and MyString-Dash ?

<html>
<body>
    <h1>Content</h1>
    <div id="content">
        AString
        MyString
        MyString-Dash
    </div>
    <br>
    <h1>Matches (expecting MyString,MyString-Dash)</h1>
    <div id="matches"></div>
</body>
<script>
    var content = document.getElementById('content');
    var matchesDiv = document.getElementById('matches');
    var pattern = '\\b(MyString|MyString-Dash)\\b';
    var matches = content.innerHTML.match(pattern);
    matchesDiv.innerHTML = matches;
</script>
</html>
+1  A: 

Swap the order of your matching so that the longest possible is first:

content.innerHTML.match(/\b(MyString-Dash|MyString)\b/)

I believe regular expressions match from left to right. Just tested this in Firebug, it works.

I would also change that pattern var to a regular expression literal, from '\\b(MyString-Dash|MyString)\\b' to /\b(MyString-Dash|MyString)\b/g

You want the /g in there because that will make the regular expression return all matches, rather than just the first one.

Bryan Ross
A: 

There are a couple problems with your assumptions.

Running a match against the text never finds a match for the second thing (MyString-Dash) because the '-' (dash) character isn't a word boundary character.

There's no such thing as a word boundary character. Word boundaries are the space between characters that match \w and don't match \w. - does not match '\w', so on either side of it is a "word boundary", but that won't break your match: the - is a literal dash in your regex and the \b's are far outside of it.

Second, regexen will always try to match the first thing they can in the string that matches your regex. As long as that first string in there matches, it will keep returning the first thing in there. You're asking for the first match when you ask for a match. That's the design. If you didn't want it to match MyString, don't ask for it.

Third, most regex engines prioritize 'completing a match' over length of a match. Thus, 'MyString', if it matches, will always be the first thing it returns. You'll have to wait until Perl 6 grammars for a regex engine that prioritizes length. :)

The only way for you to really do this is with two checks, one for the longer one, first, and then one for the shorter one. It will always match the first thing it finds that works. If you have a priority other than that, it's up to you to code it in as separate checks.

Robert P
cool - good information to know. I had read here (http://www.regular-expressions.info/wordboundaries.html) that \b and \w where basically the equivalent of [a-zA-Z0-9_]
Mike Hugo
That's not at all what it says on the page. :) "The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary"
Robert P
And: "Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters"."
Robert P