views:

160

answers:

5
+2  Q: 

Javascript regex

I was trying to do a regex for someone else when I ran into this problem. The requirement was that the regex should return results from a set of strings that has, let's say, "apple" in it. For example, consider the following strings:

"I have an apple" "You have two Apples" "I give you one more orange"

The result set should have the first two strings.

The regex(es) I tried are:

/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/

The problem with the first one is that words like "aapple", "bapple", etc (ok, so they are meaningless, but still...) test positive with it, and the problem with the second one is that when a string actually starts with the word "apple", "Apples and oranges", for example, it tests negative. Can someone explain why the second regex behaves this way and what the correct regex would be?

+8  A: 
/(^.*?\bapples?\b.*$)/i

Edit: The above will match the entire string containing the word "apples", which I thought is what you were asking for. If you are just trying to see if the string contains the word, the following will work.

/\bapples?\b/i

The regex(es) I tried are:

/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/

The first one just checks for the existence of the following characters, in order: a-p-p-l-e, regardless of what context they are used in. The \b, or word-boundary character, matches any spot where a non-word character and a word character meet, ala \W\w.

The second one is trying to match other characters before the occurrance of a-p-p-l-e, and is essentially the same as the first, except it requires other characters in front of it.

The one I answered with works like following. From the beginning of the string, matches any characters (if they exist) non-greedily until it encounters a word boundary. If the string starts with apple, the beginning of a string is a word-boundary, so it still matches. It then matches the letters a-p-p-l-e, and s if it exists, followed by another word boundary. It then matches all characters to the end of the string. The /i at the end means it's case-insensitive, so 'Apple', 'APPLE', and 'apple' are all valid.

If you have the time, I would highly recommend walking through the tutorial at http://regular-expressions.info. It really goes in-depth and talks about how the regular expression engines match different expressions, it helped me a ton.

tj111
beat me to it :)
annakata
It would fail on appleseed, as in Johnny. I doubt it is a big deal though.
gpojd
Please don’t use the “^.*?” and “.*?$”!
Gumbo
@Gumbo, it's not necassary, but he is trying to match the entire string, not just the word apple. If he wants to capture the whole string, he could just slap some parenthesis on there.
tj111
Thanks a lot for the detailed answer. But searching for other characters before and after apple(s) is superfluous in this scenario and is seen to produce the exact same result as when it is excluded. Just `\bapples?\b/i` does the job. Correct me if I'm missing something as I'm very new with regexes.
Mussnoon
Thanks again for taking the time to give such a detailed answer. I don't do "plzsendtehcodez" questions, so usually look for explanations alongside so I know what's happening and why. And your answer essentially answered both my questions.
Mussnoon
A: 
/\bapple/i

\b is a word boundary.

To explain why your attempts do not work, the first one does not check that it is the beginning of the word, so it can have something before it. The second regex you gave says that something must be before the word "apple", but it can't be alphanumeric.

gpojd
A: 

Your second regex requires a nonalphanumeric character before the first a in apple. "apple" doesn't satisfy this. As others note, "\b" matches not a character, but a word boundary position.

+3  A: 

To build on @tj111, the reason your second regex fails is that [^a-zA-Z0-9] requires that a character matches; that is, there is some character in that position, and its value is not contained in the set [a-zA-Z0-9]. Markers like \b are called "zero-width assertions". \b, in particular, matches against boundaries between characters or at the beginning or end of a string. Because it is not matching against any character, its "width" is zero.

In sum, [^a-zA-Z0-9] requires a character that does not take a particular value be present, while \b requires only that a boundary be present.

Edit: @tj111 has added most of this to his response. I'm in too late, again :)

kyle
still worth a +1 for discussing "zero-width" assertions.
tj111
gotta love regular-expressions.info :)
kyle
+1  A: 

This works for apple and apples and its case-insensitive spellings:

var strings = ["I have an apple", "You have two Apples", "I give you one more orange"];
var result = [];
var pattern = /\bapples?\b/i;
for (var i=0; i<strings.length; i++) {
    if (pattern.test(strings[i])) {
        result.push(strings[i]);
    }
}
Gumbo