views:

1500

answers:

8

Hey,

I want to match a string to make sure it contains only letters.

I've got this and it works just fine:

var onlyLetters = /^[a-zA-Z]$/.test(myString);

BUT

Since I speak another language too, I need to allow all letters, not just A-Z. Also for eg

é ü ö ê å ø

does anyone know if there is a global 'alpha' term that includes all letters to use with regExp? Or even better, does anyone have some kind of solution?

Thanks for any help

// I

EDIT: Just realized that you might also wanna allow '-' and ' ' incase of a double name like: 'Mary-Ann' or 'Mary Ann'

A: 

There are some shortcuts to achive this in other regular expression dialects - see this page. But I don't believe there are any standardised ones in JavaScript - certainly not that would be supported by all browsers.

David M
In particular, the one he seems to want is `\p{L}` aka `\p{Letter}`
MSalters
+3  A: 

There should be, but the regex will be localization dependent. Thus, é ü ö ê å ø won't be filtered if you're on a US localization, for example. To ensure your web site does what you want across all localizations, you should explicitly write out the characters in a form similar to what you are already doing.

The only standard one I am aware of though is \w, which would match all alphanumeric characters. You could do it the "standard" way by running two regex, one to verify \w matches and another to verify that \d (all digits) does not match, which would result in a guaranteed alpha-only string. Again, I'd strongly urge you not to use this technique as there's no guarantee what \w will represent in a given localization, but this does answer your question.

David Pfeffer
+1  A: 

I don't know anything about Javascript, but if it has proper unicode support, convert your string to a decomposed form, then remove the diacritics from it ([\u0300-\u036f\u1dc0-\u1dff]). Then your letters will only be ASCII ones.

Virgil Dupras
This won't work because some of his letters are not just diacritical ASCII. `ø` for example was mentioned, and this isn't the diacritic of `o` as far as I know.
David Pfeffer
Hum, yeah. But if he's going to enumerate all valid characters, doing this diacritic tricks is going to save him quite a few enumerations, even if he has to specify `ø` separately.
Virgil Dupras
+3  A: 

You could aways use a blacklist instead of a whitelist. That way you only remove the characters you do not need.

Hazior
never heard of it but it sort of speaks for itself.don't u just check weather it does not contain this that etc?
meow
A blacklist is is excluding what you do not need. A whitelist is only allowing what you need. Blacklists are used when you only want to ban certain characters like / or <.
Hazior
so do you declare a blacklist in a special way or is it just a regular regexp saying "does not contain" instead of does?
meow
http://www.hendricom.com/forums/index.php?showtopic=2282^ is the blacklist symbol though.
Hazior
That blacklist would need to be pretty long to be sensible.
Debilski
if the blacklist symbol is ^how come /^[a-zA-Zéüöêåø]*$/.test(myString) returns false when myString contains digits? shouldn't it be the other way around then?uhh nvm :-p
meow
You only need to blacklist the symbols that you dont want them typing. It doesn't have to be long. But whitelist is the best coding practice in my opinion.
Hazior
@Isabell: Since you said nvm is it safe you assume you figured it out?
Hazior
yeah thanks anyways
meow
If the character set is UTF-16 the blacklist would need to be about 65k long!
Pool
@The Feast: So say you just want to blacklist "'" It would be 65k long? Maybe to their optimal solution it may be large but you could also do a combination of whitelisting/blacklisting.
Hazior
+6  A: 

I don’t know the actual reason for doing this, but if you want to use it as a pre-check for, say, login names oder user nicknames, I’d suggest you enter the characters yourself and don’t use the whole ‘alpha’ characters you’ll find in unicode, because you probably won’t find an optical difference in the following letters:

А ≠ A ≠ Α  # cyrillic, latin, greek

In such cases it’s better to specify the allowed letters manually if you want to minimise account faking and such.

Addition

Well, if it’s for a field which is supposed to be non-unique, I would allow greek as well. I wouldn’t feel well when I force users into changing their name to a latinised version.

But for unique fields like nicknames you need to give your other visitors of the site a hint, that it’s really the nickname they think it is. Bad enough that people will fake accounts with interchanging I and l already. Of course, it’s something that depends on your users; but to be sure I think it’s better to allow basic latin + diacritics only. (Maybe have a look at this list: Latin-derived_alphabet)

As an untested suggestion (with ‘-’, ‘_’ and ‘ ’):

/^[a-zA-Z\-_ ’'‘ÆÐƎƏƐƔIJŊŒẞÞǷȜæðǝəɛɣijŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊIJĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịijĵķƙĸĺļłľŀʼnńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ]$/.test(myString)

Another edit: I have added the apostrophe for people with names like O’Neill or O’Reilly. (And the straight and the reversed apostrophe for people who can’t enter the curly one correctly.)

Debilski
good point.it's for a form and the Name input.come to think about it, I have seen loads of "choose a username (A-Z 0-9 - .)"then if ur greek, I guess ur just unlucky :-p
meow
wow look at that! looks like u managed to catch all werid characters ever made :-p and it works great! awesome job! thanks for that!
meow
I'm positive that regex can be improved somewhat by using character ranges. Something like: `[A-Za-zÀ-ÿ]` would catch all the ASCII letters. Check http://en.wikipedia.org/wiki/List_of_Unicode_characters for a full list.
DisgruntledGoat
But between ‘À’ and ‘ÿ’ there is ‘×’ and ‘÷’ which you might want to exclude. Nonetheless, if ranges work also for unicode characters, one could just include the ranges of Latin Extended-A and Extended-B and the Basic Latin stuff.
Debilski
@Debilski, your totally right, ‘×’ and ‘÷’ are not accepted.This is the one I choose:/^[a-zA-Z\- ÅåÄäÖöØøÆæÉéÈèÜüÊêÛûÎî]*$/
meow
+3  A: 

This can be tricky, unfortunately JavaScript has pretty poor support for internationalization. To do this check you'll have to create your own character class. This is because for instance, \w is the same as [0-9A-Z_a-z] which won't help you much and there isn't anything like [[:alpha:]] in Javascript. But since it sounds like you're only going to use one other langauge you can probably just add those other characters into your character class.

By the way, I think you'll need a ? or * in your regexp there if myString can be longer than one character.

The full example,

/^[a-zA-Zéüöêåø]*$/.test(myString);

Mike Nelson
thanks for that! missed the * in the end
meow
you're welcome :)
Mike Nelson
+4  A: 

You can't do this in JS. It has a very limited regex and normalizer support. You would need to construct a lengthy and unmaintainable character array with all possible latin characters with diacritical marks (I guess there are around 500 different ones). Rather delegate the validation task to the server side which uses another language with more regex capabilties, if necessary with help of ajax.

In a full fledged regex environment you could just test if the string matches \p{L}+. Here's a Java example:

boolean valid = string.matches("\\p{L}+");

Alternatively, you could also normailze the text to get rid of the diacritical marks and check if it contains [A-Za-z]+ only. Here's again a Java example:

string = Normalizer.normalize(string, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
boolean valid = string.matches("[A-Za-z]+");

PHP supports similar functions.

BalusC
+1  A: 

You could use a blacklist - a list of characters to exclude.

Also, it is important to verify the input on server-side, not only on client-side! Client-side can be bypassed easily.

frunsi