views:

197

answers:

8

Hi,

I have been finding some articles and post which suggest not to use the regular expression to validate user data. I am not sure of all the things but i usually find it in case of email address verification.

So i want to be clear whether using regular expression for validating user input is good or not? if it is good then what is bad with it for validating email address?

Edit:

So can we say that for basic primary validation of data types we can use regex and it is good and for full validation we need to combine it with another parser.

And for second part for email validation in general usage we can use it but as per standard it is not appropriate. Is it?

Now confusion in selecting correct one answer

+1  A: 

For e-mail addresses is good to use regular expressions. It will work in most of the cases.

In general: you should validate with regular expressions whatever can be expressed as a regular language

Victor Hurdugaci
That depends on the regular expression that you use. I've seen too many false negatives to encourage people to use regular expressions for email address checking (at least without a *lot* of provisos).
David Dorward
+3  A: 

Your question seems to have two parts: (1) is using regular expressions for data validation bad, and (2) is using them for validating email addresses bad?

Re (1), this really depends upon the situation. In many situations a regular expression will be more than adequate to validate user input; for example, validating that a username has only alphanumeric characters. Where a set of regular expressions will probably be inadequate is when the input might be passed to something like a database query or an eval() statement. In these instances there may be language constructs like recursion that cannot be handled with regular expressions, and, more generally, you will want something that knows a lot about the target language to do the validation (and sanitization).

In most cases you'll want to escape the input so that it will will be an innocuous string in the target language.

If you are validating the correctness of code, you will want a full-blown parser for this. A parser may make use of regular expressions, but typically parsers use other things to do the heavy lifting.

Eric W.
i agree for breaking my question in two parts. So well then it is good to use regex for data 1st step of data validation and later use another parser for full validation. So what abt 2nd part of email?
KoolKabin
I don't know enough about the case of validating email addresses to be able to say one way or another. The sense I get is that it's complex enough that you might want to use an existing library for it, but I've also seen some regexes that try to validate email addresses, and these may be adequate. If you need a high level of correctness, I would become familiar with the relevant RFCs. This might be made more difficult because not all email providers may require strictly valid email addresses. Just some thoughts--someone else will know more here.
Eric W.
+3  A: 

It’s good because you can use regular expressions to express and test complex patterns in an easy way.

It’s bad because regular expressions can be complicated and there is much you can do wrong.


Edit    Well, ok. Here’s some real advice: First make sure that the expected valid values can be expressed using regular expression at all. That is when the language of valid values is a regular language. Otherwise you simply cannot use regular expressions (or at least not regular expressions only)!

Now that we know what can be validated using regular expressions, we should discuss what is viable to be validated using regular expressions. If we take an e-mail address as an example (like many others did), we should know what a valid e-mail address may look like (see RFC 5322):

addr-spec       =   local-part "@" domain
local-part      =   dot-atom / quoted-string / obs-local-part
domain          =   dot-atom / domain-literal / obs-domain
domain-literal  =   [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext           =   %d33-90 /          ; Printable US-ASCII
                    %d94-126 /         ;  characters not including
                    obs-dtext          ;  "[", "]", or "\"

Here we see that the local-part may consists of a quoted-string that may contain any printable US-ASCII character (excluding \ and "", but including @). So it is not sufficient to test if the e-mail address contains just one @ if we want to allow addresses according to RFC 5322.

On the other hand, if we want to allow any valid e-mail address according to RFC 5322, we would also allow addresses that do probably not exists or are just senseless in most cases (e.g. ""@localhost).

Gumbo
So well then it is good to use regex for 1st step of data validation and later use another parser for full validation
KoolKabin
@KoolKabin: Using regular expressions would only fulfill the syntactic test but not a semantic test. *Syntactic valid* means it complies with the standards. *Semantic valid* means it makes sense in the field/area you want to use it.
Gumbo
this things made me more clear... thnx
KoolKabin
A: 

The concerns are probably about the fact that often the regular expressions in use do not cover all the possible (valid) inputs and/or restrict the user to much in what he can input.

I see no other way to validate if some user input matches a certain schema (I mean, that is what regular expressions are for), so they are essential (imo) for user input validation. But you definitely have to put some time into designing an expression, to make sure it really works, also in extreme cases.

Take credit card numbers. You have to consider the ways a user might enter them:

1234-5678
// or
1234 5678
// or
1234 - 5678

And now you have two possibilities:

  1. You restrict the input to the first case which will result in an easier expression but will restrict (and maybe annoy) the user the most.
  2. You create an expression that accepts any of these possibilities, making the expression more complicated (hence harder to maintain) but is more use friendly.

It is a trade-off.

Felix Kling
3. You accept any input but remove any non-digit character, validate that value and reformat it if desired.
Gumbo
"The concerns are probably about the fact that often the regular expressions in use do not cover all the possible (valid) inputs and/or restrict the user to much in what he can input." This can be said about any validation method
Justin Johnson
@Gumbo: Good point :) That might work here, but I thought about it more as a general example.
Felix Kling
+1  A: 

If the pattern of the data you are validating can be expressed completely and correctly using regular expressions, you can use them safely with no worries. However not all textual patterns can be expressed using regular expressions (e.g. context free grammars). In such cases you might need to write a parser or a custom method for validating the data.

Bytecode Ninja
+2  A: 

Regular expressions can be bad for three reasons:

  1. They can get really complicated, and eventually unmaintainable. It's very easy to make mistakes.
  2. There are certain types of text that cannot be parsed with regular expressions at all (e.g. HTML). Basically, anything with nested patterns cannot be parsed with regular expressions. You wouldn't be able to parse a programming language with regex, for example.
  3. Depending on what kind of text you are working with, it may be easier and clearer if you just write your own code to parse it.

But if neither of these is an issue for whatever you are working with, then there is nothing wrong with using regular expressions. I would say validating email addresses is a good use of regex.

musicfreak
+1  A: 

Regular expressions are a tool like any other, albeit a very powerful one.

They are so powerful that people using them tend to suffer from the problem of everything looking like a nail (when you have a hammer). This leads to them being used in situations where another method would be more verbose but more efficient and more maintainable.

In the specific case of email addresses, the main problem here is that there are a very large number of regular expressions out there which claim to validate email address syntax, but are loaded up with problems that cause false negatives.

The main problems with them include:

  • Disallowing plus characters in the first half of the address (despite them being relatively common)
  • Limiting the TLD to three characters (this blocking out the .museum TLD)
  • Limiting the TLD to two character country code TLDs or a list of specific TLDs (thus forcing it to be updated whenever a new TLD comes into play — guess what never happens?)

Email addresses are so complex that a regular expression shouldn't really try to do anything more then:

  1. Something that doesn't include an @
  2. An @
  3. Something that doesn't include an @
  4. A .
  5. Something that doesn't include an @
David Dorward
The 'local-part' of the email address actually *can* include an @ symbol, it's just not very common.
MikeD
A: 

Regexes aren't bad for validating most data, if it's a Regular Language.

But, as has been noted, sometimes they can become difficult to maintain, and the programmers introduce errors.

The simplest way to mitigate the situation is with Tests/TDD. These tests should be calling a method that uses the regular expression to validate email addresses (I currently use this regex /^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$/i which is working well enough. This way, when you get a false positive or false negative, you can add another test for that case, adjust your regular expression, and ensure you didn't break some other condition.

If TDD seems a bit much, a tool like Expresso lets you save regexes with test data, and that can aid in keeping track of values that should pass/fail and aid in creating and understanding your regex.

WARNING:

Take some care in constructing regular expressions. There is potential for introducing ReDos vulnerabilities

See: http://msdn.microsoft.com/en-us/magazine/ff646973.aspx

In short, a poorly constructed regex, given the right input can take hours to execute effectively killing your servers performance.

Chad