Freely-available, well-debugged regular expressions

tags:

regex

views:

116

answers:

+2 Q:

Freely-available, well-debugged regular expressions

I was reading ICU documentation and came across this fine advice:

For common tasks like this there are libraries of freely available regular expressions that have been well debugged. It's worth making a quick search before writing a new expression.

To which libraries of well-debugged regular expressions do you commonly refer?

I'm not much taken with http://regexlib.com where the expressions don't seem all that well debugged. It appears to have no QA process besides user comments and ratings.

+1 A:

No - do not use regular expressions to parse emails, even if they have been "well debugged". Chances are they still don't work. Definitely use a library that is designed to parse emails, but stay away from libraries of regular expressions. I've seen one regular expression for emails that claimed to exactly follow the standards and it was several pages long and came with a warning that before applying it you had to first strip comments from the email (which would require a second regular expression).

If you insist on using a regular expression to parse emails then please make it accept invalid addresses rather than rejecting valid addresses.

Mark Byers 2010-04-21 19:55:31

I think you missed the point of the question. Yes, ICU uses an email regex as an example, but I don't think that's what fsb is interested in.

Matthew Flaschen 2010-04-21 20:01:19

@Matthew Flaschen: No, I understand the question. Pretty much all common use cases (parsing email addresses, URLs, HTML, etc.) are best done by dedicated parsing libraries, not regular expressions. Most of these tasks are just too complex for regular expressions. Even though the regex fills several pages trying to cover all possibilities, they inevitably still miss some cases.

Mark Byers 2010-04-21 20:05:39

Mark, I think valid applications for regex do indeed exist. I don't think this is a radical point of view.

fsb 2010-04-21 20:23:01

If you're using this for email validation but decide to not follow Mark's advice overall, definitely follow his last suggestion and provide a way for someone to 'override' the regex failure. It's quite irritating to have a form tell you the email address you've been using for years is invalid.

Michael Burr 2010-04-21 20:26:44

@fsb: Definitely, but it should be something like: match anything with an "@" in it, and then pass that to a dedicated email parser to check whether or not it is an email. Trying to figure out whether something is or is not an email is not something regular expressions are good at.

Mark Byers 2010-04-21 20:28:23

Mark, you may well be right about the specific case of email addresses. Not so sure that pretty much all common uses of regex are misguided.

fsb 2010-04-21 20:43:31

@fsb: Regex is certainly a valuable tool - no disagreement there. Maybe I'm being too cynical, but my experience so far is that regex libraries are mostly just useful as a reminder list of examples of when not to use regular expressions. However... if someone does link to a very good quality regex library, it'll get my upvote. :-)

Mark Byers 2010-04-21 21:20:27

Mark, I've been writing REs for years and never had much joy reusing those of others. So my curiosity was piqued when the ICU docs suggested there exists a substantial corpus of REs waiting to be tapped. I haven't found it. Hence my original question. The closest thing to an answer so far is RegexBuddy but I don't have Windows.

fsb 2010-04-22 14:18:54

+3 A:

I can't say enough good things about RegexBuddy. It comes with a large library within it. http://www.regexbuddy.com/library.html

It's not free, but if you're on a Windows box it's well worth the investment.

The interactive mode lets you debug your own expressions in real time - and it has many engines (.NET, Perl, etc.) So - it'd let you find that particular leap year bug pretty quick :).

Nicolas Webb 2010-04-21 19:57:48

The screenshot showing their regex library shows an example of a regex that doesn't work properly. It matches invalid dates. Using a dedicated date parsing library instead would be a more robust solution.

Mark Byers 2010-04-21 20:15:41

@Mark Byers I'm not trying to suggest a regex is the answer to everything. It's good for matching patterns, not parsing. Like what I'm using it for **right now** :).

Nicolas Webb 2010-04-21 20:18:17

@Nicolas: Yeah, at least that one was failing to the safe side. I'm sure there are much worse examples than that one, but I'm not paying the license fee to find out. :)

Mark Byers 2010-04-21 20:32:08

@Mark Byers At the risk of sounding like a shill - they have a 30 day trial :).

Nicolas Webb 2010-04-21 21:48:44

@Nicolas: OK, I'll give it a try... but probably not today.

Mark Byers 2010-04-21 22:11:54

+2 A:

I disagree with Mark.

He is right technically, but it depends on the exact context you're trying to do it in whether or not using regex is an acceptable risk.

Don't let the "good enough" solution be killed because you're trying for perfection.

RichardBlizzard 2010-04-21 20:00:34

+1 A:

If you take the time to learn regular expressions you won't need a library of expressions. I remember consciously deciding to learn regular expressions (years ago -- measured in decades sigh) and it has paid off countless times since.

Regular expressions aren't hard. They are just a little mini programming language. If you can write code you can learn regular expressions. One solid day of study should be plenty of time for anyone with a knack for programming.

Then, once you know them you can make an educated decision as to when they are an appropriate solution. Otherwise you're just throwing ideas against a wall in the hopes that one of them sticks. Plus, writing a regular expression from scratch will likely always be quicker and easier than trying to look up a pattern in a library and deciding whether it's good or not.

Bryan Oakley 2010-04-21 20:04:30

No argument that regex is quite easy to learn. And I wouldn't advocate using a stock regex without understanding it. But code reuse is not categorically unreasonable. For example, sometimes it's hard to understand and define the regularities themselves (e.g. the set of all strings that are valid email addresses) while the job of translating those regularities into any given regex language would be straightforward.

fsb 2010-04-21 20:37:17

+4 A:

The problem with regular expression libraries, even those that are well-tested, is that they haven't been tested on your data or for your purposes. A regex that worked fine on somebody else's data for their purposes may not work at all for you.

The screen shot at http://www.regexbuddy.com/library.html indeed shows a regex that matches invalid dates such as February 30th. The comment with the regular expression explains this. The comment is not fully visible in the screen shot though.

This is a perfect example of why you have to be careful with regex libraries and copy-and-paste programming in general. The regex \d\d/\d\d/\d\d\d\d may be perfectly acceptable for extracting dates from a file if you know that the file never contains something like 99/99/9999. If a file only contains valid dates and other data that doesn't look like dates at all, then the simple regex is perfectly adequate for extracting the dates. And even if the data can contain invalid dates, you may choose to allow the regex match them and to filter the invalid dates out in the procedural code that processes the regex matches.

As for email addresses, the only way to determine whether it is valid is to send an email to it and get a response. Even the lack of a bounce message doesn't mean that the email was saved in somebody's mailbox or that it will be read by anyone. A regex can be useful to filter out things that are obviously not email addresses so you can skip the much more expensive step of sending a verification email. A regex can also be useful to extract email addresses from documents or archives. But it indeed can't say whether [email protected] is a valid email address or not. It looks like it is, but it isn't. Email sent to this address is saved to /dev/null.

Jan Goyvaerts 2010-04-22 09:11:47

The idea in the original question – reuse of open-source REs – is not a call to mindless "copy-and-paste programming". I start with the assumption that folk on SO have sufficient competence and rationality to at least think about their computer programs. In my experience this is generally true and I wouldn't use SO otherwise. Someone else may have a better answer, idea, code segment, RE, etc. than I have managed to produce but this "better" is my context specific judgement.

fsb 2010-04-22 14:08:53

ansaurus

tags:

views:

answers:

Freely-available, well-debugged regular expressions

related questions