views:

1757

answers:

8

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).

So what can I use instead of \w to match all letters in all languages?

A: 

Perhaps \S (non-whitespace).

chaos
+11  A: 

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.

Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

David
Thanks, for the inner parts of the url I ended up matching everything except space, '.' and '/'. Anything else I might be missing?
Doron Yaacoby
Perhaps colon, ':', which could be used to separate a URL from a port number
David
+1  A: 

Have a look at http://www.regular-expressions.info/refunicode.html.

It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.

Gamecat
This page has a more thorough explanation and listing of character patterns: http://www.regular-expressions.info/unicode.html
enobrev
+1  A: 

Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.

Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.

PEZ
The answer you linked to was wrong. I've updated it.
Jan Goyvaerts
Too bad. \p would have been just what the doctor ordered.
PEZ
A: 

If you're the one generating URLs with non-english letters in it, you may want to reconsider.

If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

Triptych
Sadly I can't control the url-creation, and they almost always will contain Hebrew Characters.
Doron Yaacoby
+4  A: 

The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.

JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:

[\u0590-\u05FF]

This simply matches any code point in the Hebrew block.

You can match any ASCII word character or any Hebrew character with:

[\w\u0590-\u05FF]

Jan Goyvaerts
A: 

Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters. Normally all other characters should be represented by percent-notation:

In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. // URI: Generic Syntax

Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.

If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

Gnudiff
+2  A: 

I think you are looking for this regex:

^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$
lani
Welcome to Stack Overflow. I never tried, but `א-ת` may work as well, even including the final letters - http://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet .
Kobi