tags:

views:

303

answers:

6

hi, I'd like to know if it's a good idea avoid regex.

actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:

[] '|' \A \B \d \D \W \w \S \Z $ * ? ...

it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.

it gets more unreadable when it's bigger, example: validators.py

email_re = re.compile(
    r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
    r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' #     quoted-string
    r')@(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE)  # domain

so, I'd like to know a reason to not avoid regex?

+15  A: 

No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.

What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)

For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".

There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.

If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)

paxdiablo
I agree having an email sent to confirm the existence of the address is great, but it's nice to check if the email address entered is invalid. The user might forget the `@` and you can check if it's there and give an error. It's better to do that than accept it and fail at emailing the message. The user wouldn't know why he's not getting his email.
vlad003
@vlad003 - so then you just use `if "@" in email_address...` - in which case, a regex is overkill. Anything more complicated than that, and you're asking for trouble...
detly
@vlad, there's a big difference between checking for a "@" and the monstrosity you have to use for a fully validated email address. By all means do a simple check like that, it's at least readable :-)
paxdiablo
The `@` was just an example. There may be many errors a person could make when typing in their email. If it's invalid and the app accepts it, then the person will hit submit and expect their email (which they'll never get). Resending won't work; and changing the address won't be possible either... And I'm sure servers creating email addresses will need to know if the one the user wants is valid or not.
vlad003
But I can see why it's nice to avoid validating email addresses yourself. I guess in that case you can use what someone else's written? http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
vlad003
We've implemented these things before, and we make it clear that it's the user's responsibility to make sure they get it right (that's why we get them to enter it twice and check that they're identical) They're also told that, if they don't get a confirm email in two days, they have to fix the problem and try again. It's not just that the email address may be invalid, you have to ensure it's not blocked by spam filters and such - even a perfectly valid email address can be unusable due to things like that, and the only way to know for sure is the confirm link: a regex won't help there :-)
paxdiablo
@vlad003 - if "it's nice to avoid validating email addresses yourself" equates to having an 82 line regex consisting largely of various types of brackets lurking in your code... then we may have different definitions of "nice" :P By the way, even that doesn't enforce the 255 character limit mandated by the standard (as far as I can tell, anyway).
detly
The point is that you shouldn't avoid helping the user because you are afraid of writing regular expressions. Forcing the user to confirm by sending them an email is great, but it doesn't solve the same problem as checking that the email is valid. Forcing the user to type the email twice is even worse - he will probably just end up copy-and-pasting it, and you still haven't helped catch trivial mistakes.
Avi
As I've said before, you can have a valid email address which is entirely invalid because it's just plain _wrong._ A regular expression won't detect that I've misspelt my name as `paxdaiblo`. Confirmation links will. I'm not arguing with simple checks here - ensuring that the basic rules are followed are a good use of regular expressions. Using a behemoth to try and catch every possible problem doesn't actually _catch_ every possible problem and engenders a false sense of security.
paxdiablo
And, in actual fact, a confirmation link _does_ solve exactly the same problem as checking the email format (and more). It's slower, I'll grant you that but it's far more dependable. But I can see that we're probably going to have to agree to disagree, so I won't push the point any more. Disagreement is the hallmark of intelligent conversation - I would find it hard envisaging a place more boring than one where everyone agreed with me :-)
paxdiablo
I agree with not going crazy on the validation, but keep this [cautionary tale](http://xkcd.com/327/) in mind.
Hugh Brackett
+2  A: 

If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.

Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.

On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.

Alex Martelli
+6  A: 

Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.

Bryan Oakley
A: 

Regular expressions are likely the right tool for extracting/validating email addresses...

To extract one or more email addresses from raw text:

import re
pat_e = re.compile(r'(?P<email>[\w.+-]+@(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
  emails.append(r.group('email'))
return emails

To see if a single piece of text is a valid email:

import re
pat_m = re.compile(r'([\w.+-]+@(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
  return True
return False
damzam
This fails on anything with a plus sign (`+`) before the `@`, which is [perfectly valid](http://en.wikipedia.org/wiki/Email_address#Specification) for an email address.
detly
What happens when they decide to create a 5-letter TLD?
Gabe
Ever heard of .museum and .travel TLDs?
Schnouki
Thanks for the comments. I've updated the patterns to accommodate longer TLDs and the character '+' when it appears before the @.
damzam
This still does not fit the standards. "a+!=" is a valid local part of the address. As is ".{^_^}."
detly
+1  A: 

Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):

# -*- coding: utf8 -*- 
import string
print("Valid letters in this computer are: "+string.letters)
import re 
def validateEmail(a): 
    sep=[x for x in a if not (x.isalpha() or 
                              x.isdigit() or 
                              x in r"!#$%&'*+-/=?^_`{|}~]") ] 
    sepjoined=''.join(sep) 
    ## sep joined must be ..@.... form 
    if len(a)>255 or sepjoined.strip('.') != '@': return False 
    end=a 
    for i in sep: 
        part,i,end=end.partition(i) 
        if len(part)<2: return False 
    return len(end)>1 

def emailval(address): 
    pattern = "[\.\w]{2,}[@]\w+[.]\w+" 
    return re.match(pattern, address)

if __name__ == '__main__': 
    emails = [ "[email protected]","[email protected]", "[email protected]", 
               "[email protected]", "[email protected]","marjaliisa.hämälä[email protected]", 
               "marja-liisa.hämälä[email protected]", "marjaliisah@hel.",'tony@localhost',
               '[email protected]','me@somewhere'] 

    print('\n\t'.join(["Valid emails are:"] + 
                      filter(validateEmail,emails)))

    print('\n\t'.join(["Regexp gives wrong answer:"] + 
                       filter(emailval,emails)))

""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
        [email protected]
        [email protected]
        tony@localhost
        [email protected]
        me@somewhere
Regexp gives wrong answer:
        [email protected]
        [email protected]
        [email protected]
"""

EDIT: cleaned up the regex filter function from this ancient code, edited for @detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.

This code by purpose does not accept the normal a@b as valid email address, but does accept me@somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.

Tony Veijalainen
A: 

(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)

This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.

kindall
Putting "official" inside quotes is a dead giveaway that it's anything but official :-)
paxdiablo
I went looking for how "official" it was and discovered that you were right. So I substituted a link to an even hairier regex that claims to fulfill most of the RFC 822 standards. :-)
kindall