tags:

views:

220

answers:

5

Hi,

I have a regex that I am using to validate email addresses. I like this regex because it is fairly relax and has proven to work quite well.

Here is the regex:

(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+@[^\.][\w\.\-]+\.[A-Za-z]{2,}>?

Ok great, basically all reasonably valid email addresses that you can throw at it will validate. I know that maybe even some invalid ones will fall through but that is ok for my specific use-case.

Now it happens to be the case that [email protected] does not validate. And guess what x.com is actually a domain name that exists (owned by paypall).

Looking at the regex part that validates the domain name:

@[^\.][\w\.\-]+

It looks like this should be able to parse the x.com domain name, but it doesn't. The culprit is the part that checks that a domain name can not begin with a dot (such as [email protected])

@[^\.]

If I remove the [^.] part of my regex the domain x.com validates but now the regex allows domains names beginning with a dot, such as .test.com; this is a little bit too relax for me ;-)

So my question is how can the negative character list part affect my single character check, basically the way I am reading the regex is: "make sure this string does not start with a dot", but apparantly it does more.

Any help would be appreciated.

Regards,

Waseem

+2  A: 

If you change [^\.][\w\.\-]+ to [^\.][\w\.\-]*, it will work as you expect!

The reason is: [^\.] will match a single character which is not a dot (in your case, the "x" on "x.com", then you will try to reach 1 or more characters, and then a dot. You will match the dot after the x, and there are no more dots to match. The * will match 0 or more characters after the first one, which is what you want.

Luís Guilherme
+2  A: 

Change the quantifier +, meaning one or more, to *, meaning zero or more.

Michael Petito
A: 

Change @[^\.][\w\.\-]+ to @[^\.][\w\.\-]* The reason you need this is that [^\.] says match a single character that is not a dot. Now there are no more characters left so the [\w\.\-]+ has nothing to match, even though the plus sign requires a minimum of one character. Changing the plus to a star fixes this.

klausbyskov
A: 

Look at the broader context in your pattern:

@[^\.][\w\.\-]+\.[A-Za-z]{2,}

So for [email protected],

  • [^.] matches x
  • [\w.-]+ matches .
  • \. needs a dot but finds c

Change this part to @[^.][\w-]*\.[A-Za-z]{2,}

Greg Bacon
+3  A: 

As Luis suggested, you can use [^\.][\w\.\-]* to match the domtain name, however it will now also match addresses like [email protected] and john@@.com. You might want to make sure that there is only one period at a time, and that the first character after the @ is more restricted than just not being a period.

Match the domain name and the period (and subdomains and their periods) using:

([\w\-]+\.)+

So your pattern would be:

(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+@([\w\-]+\.)+[A-Za-z]{2,}>?
Guffa