views:

114

answers:

4

I'm trying to make a dynamic regex that matches a person's name. It works without problems on most names, until I ran into accented characters at the end of the name.

Example: Some Fancy Namé

The regex I've used so far is:

/\b(Fancy Namé|Namé)\b/i

Used like this:

"Goal: Some Fancy Namé. Awesome.".replace(/\b(Fancy Namé|Namé)\b/i, '<a href="#">$1</a>');

This simply won't match. If I replace the é with a e, it matches just fine. If I try to match a name such as "Some Fancy Naméa", it works just fine. If I remove the word last word boundary anchor, it works just fine.

Why doesn't the word boundary flag work here? Any suggestions on how I would get around this problem?

I have considered using something like this, but I'm not sure what the performance penalties would be like:

"Some fancy namé. Allow me to ellaborate.".replace(/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/g, '$1<a href="#">$2</a>$3')

Suggestions? Ideas?

A: 

Maybe try using the \o or \x flags when using your regex.

The end of this reference for Javascript regular expressions might help you out.

As to what actual octal/hex values é is associated with, I'm not sure.

Onion-Knight
+3  A: 

JavaScript's regex implementation is not Unicode-aware. It only knows the ‘word characters’ in standard low-byte ASCII, which does not include é or any other accented or non-English letters.

Because é is not a word character to JS, é followed by a space can never be considered a word boundary. (It would match \b if used in the middle of a word, like Namés.)

/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/

Yeah, that would be the usual workaround for JS (though probably with more punctuation characters). For other languages you'd generally use lookahead/lookbehind to avoid matching the pre and post boundary characters, but these are poorly supported/buggy in JS so best avoided.

bobince
That explains it. I ended up with the following:/(\W|^)(fancy namé|namé)(\W|$)/igWhich seems to fit my needs :-)
Rexxars
+2  A: 
KennyTM
+1  A: 

String.replace() accepts callback function as its second parameter. (Don't know why so many JS tutorials omit this useful feature.) Thus, we can write our own test for word boundaries.

The solution proposed elsewhere, with regexp /(\W|^)(fancy namé|namé)(\W|$)/ig, gives false positives in cases of text such as 'naméé'.

String.prototype.isWordCharAt = function(i) {
    // should work for European languages and Unicode
    return (this.charAt(i) >= 'A' && this.charAt(i) <= 'Z')
        || (this.charAt(i) >= 'a' && this.charAt(i) <= 'z')
        || (this.charCodeAt(i) >= 0xC0 && this.charCodeAt(i) < 0x2000)
    ;
};

"Namé. Goal: Some Fancy Namé. Namé. Nénamé. Namée. Nénamée. Namé"
.replace(/(Namé|Fancy Namé)/ig, function(
match, part1, /* part2, part3, ... */ offset, fullText) {
  // Keep in mind that the number of arguments changes
  // if the number of capturing parenthesis in regexp changes.
  // We could use 'arguments' pseudo-array instead.
  var len1 = part1.length;
  var leftWordBoundary;
  var rightWordBoundary;

  if (offset === 0) {
    leftWordBoundary = fullText.isWordCharAt(offset);
  }
  else {
    leftWordBoundary = (fullText.isWordCharAt(offset - 1)
      != fullText.isWordCharAt(offset));
  }

  if (offset + len1 == fullText.length) {
    rightWordBoundary = fullText.isWordCharAt(offset + len1 - 1);
  }
  else {
    rightWordBoundary = (fullText.isWordCharAt(offset + len1 - 1)
      != fullText.isWordCharAt(offset + len1));
  }

  if (leftWordBoundary && rightWordBoundary) {
    return '<a href="#">' + part1 + '</a>';
  }
  else {
    return part1;
  }
});
Josef Svoboda