ansaurus

Question

Why does \w match only English words in javascript regex?

Answer 1

A:

Perhaps \S (non-whitespace).

chaos 2008-12-29 14:21:14

Answer 2

+11 A:

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.

Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

David 2008-12-29 14:22:06

Thanks, for the inner parts of the url I ended up matching everything except space, '.' and '/'. Anything else I might be missing?

Doron Yaacoby 2008-12-29 15:18:57

Perhaps colon, ':', which could be used to separate a URL from a port number

David 2008-12-29 20:18:10

Answer 3

+1 A:

Have a look at http://www.regular-expressions.info/refunicode.html.

It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.

Gamecat 2008-12-29 14:22:33

This page has a more thorough explanation and listing of character patterns: http://www.regular-expressions.info/unicode.html

enobrev 2008-12-29 16:42:16

Answer 4

+1 A:

Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.

Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.

PEZ 2008-12-29 14:22:51

The answer you linked to was wrong. I've updated it.

Jan Goyvaerts 2008-12-30 13:37:09

Too bad. \p would have been just what the doctor ordered.

PEZ 2008-12-30 19:05:14

Answer 5

A:

If you're the one generating URLs with non-english letters in it, you may want to reconsider.

If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

Triptych 2008-12-29 15:36:38

Sadly I can't control the url-creation, and they almost always will contain Hebrew Characters.

Doron Yaacoby 2008-12-30 15:40:14

Answer 6

+4 A:

The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.

JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:

[\u0590-\u05FF]

This simply matches any code point in the Hebrew block.

You can match any ASCII word character or any Hebrew character with:

[\w\u0590-\u05FF]

Jan Goyvaerts 2008-12-30 13:33:53

Answer 7

A:

Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters. Normally all other characters should be represented by percent-notation:

In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. // URI: Generic Syntax

Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.

If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

Gnudiff 2008-12-30 14:50:20

Answer 8

+2 A:

I think you are looking for this regex:

^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$

lani 2010-09-16 06:33:19

Welcome to Stack Overflow. I never tried, but `א-ת` may work as well, even including the final letters - http://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet .

Kobi 2010-09-16 06:39:05

ansaurus

tags:

views:

answers:

Why does \w match only English words in javascript regex?

related questions