tags:

views:

3372

answers:

36

I already know the basics of RegEx but I'm not sure where to go from here, I'm looking for both a good and above all easy to understand guide but I am also looking for things to use RegEx's for, it's all well and good reading about it but if you never use them then they will not stick in your mind.

I have already found regular-expressions.info but I'm sure there are more.

+8  A: 

The best resources I have found are:

But my advice is, get stuck in!

GateKiller
+42  A: 

As with everything, the best way to learn is by doing. Install the trial of Regex Buddy and start hacking away.

The awesomeness of Regex Buddy is that it parses your regex and presents in plain English what it is that each of your symbols and groups are doing.

See here for what it can do: http://www.regexbuddy.com/screen.html

Ishmaeel
There are a number of great books mentioned below, but with something complicated like regular expressions, working with them is the only real way to learn.
Drew Stephens
great app really, too bad there is no trial version available :(
Karim
Their other app - Regex Magic has a trial version. They also have a money-back guarantee. Having watched the demos for both apps I'm scrambling for my credit card. They do a package deal.
CAD bloke
+15  A: 

Mastering Regular Expressions, it's in the recommended reading list of both Steve Yegge and Jeff Atwood.

mreggen
+7  A: 

I'd recommend this book -

http://oreilly.com/catalog/9780596528126/

It also covers how different expression engines behave too which is quite important if you're working across different language implementations.

Kev
+1  A: 

txt2re is an excellent online regex builder - paste in a string and click to select which sections should match and it will build the expression for you in several languages.

To see the types of things people are using regex for, check out the Regular Expression Library.

If you're looking for a quick project to help work through building your own expressions, a page scraper would be a good idea. Regex is a great way to do it, and you probably won't be able to cheat by using someone else's expression.

palmsey
+3  A: 

As of about 4 months ago, I'd never used regex for anything more complicated than /[0-9]/

I read through regular-expressions.info 2 or 3 times until I felt like I really understood it, and then started applying Regex anywhere I could reasonably use it - even where it wasn't the best solution, just as a matter of practice.

I picked up RegexBuddy about a month ago, and in that time my Regex abilities have probably doubled - it's a great tool, and it makes your life so much easier with live highlighting, explanations and testing. You can also copy the explanation in to your clipboard and paste it in to your code as a comment, if you're in to that sort of thing.

mabwi
A: 

Forget the big books! This book is short, direct, cheap, and doesn't patronize you like the Dummies books do.

http://www.amazon.com/Teach-Yourself-Regular-Expressions-Minutes/dp/0672325667/ref=sr_1_3?ie=UTF8&s=books&qid=1218125461&sr=1-3

rp

+1  A: 

I use Rubular whenever I compose regular expressions. You can test your regex against strings and see what matches (including parentheses capturing). It also has a concise cheat sheet at the bottom of the page.

For in-depth info though, Mastering Regular Expressions can't be beat.

Neall
+2  A: 

I also liked the articles about how regexes actually work:

How Regexes Work

though this is more about how the insides of a regex machine actually work. I found it quite useful though.

There are a few very good regular expression articles in the Perl Journal too, though I have not had much look finding them online, and use the O'Reilly "Best of" series mainly.

kaybenleroll
+10  A: 

The Regex Coach is another great regex tool which is free and made with lisp. :)

  • It tries to describe the regular expression in plain English
  • It can show a graphical representation of the regular expression's parse tree.
binOr
Windows only though, for sad, though old versions work fine on linux.
Gregg Lind
Thank you, great tool!
aaandre
+1  A: 

Espresso

Reference

I've been hacking away with espresso and using this syntax guide recently to teach myself a bit more regex, it's worked quite well for me. I chose it over Regex Buddy because it was free.

Galbrezu
A: 

For an online and good regexp test/build app you can also check RegExr. Quite good.

ila
A: 

"Regexp Syntax Summary" has a concise chart detailing the different regex syntaxes for GNU grep, BRE, ERE, Emacs, Perl, Python, and TCL. It also has a section on which tools use which flavour of regex, and notes on grouping, back-references, and more esoteric bits of regexes.

A: 

REGex Tester is a fairly useful tool.

Ross
A: 

There is also a Windows freeware tool called Windows Grep. I haven't used it much but it has a beginner and expert mode. It's available from www.wingrep.com.

JonnyGold
+2  A: 

I am surprised why no one mentioned the BFN, Backus-Naur Form. Every time I hear someone speaking about regular expression, they sounds as if they are talking about something new. IMHO Regular expression aspirants should spend few hours trying to read what BNf, and context free grammar is.

rptony
+5  A: 

I wrote a two-part tutorial titled "Regex for people who should know regex but do not"

Part one

Part two (PDF)

It was the first reasonable-length guide I have ever written - I am quite happy with it. One thing I didn't make clear that people have pointed out: try to avoid using regex's wherever possible. Things are almost always more complicated than \w+@\w+.com, and as the saying sort-of goes "You've solve a problem with regex, now you have two"..

The "Practical examples" section of part 2 was really badly titled on my part - I am not recommending using regexs to escape data being put into SQL queries, I was just demonstrating how to use regex's in a way people should understand.

Also, http://www.regular-expressions.info/ is a good site which I still use, despite knowing regex fairly well (for example, when searching for non-capturing groups, the site returned an extremely good description of them)

dbr
+1  A: 

Having just spent the last two days implementing some regexes of my own, I can tell you that:

  • http://www.regular-expression.info is probably the best resource to learn from.
  • The best free software to use is RegexPal (but which frustratingly does not have support for conditionals, but if your regexes are simple it's absolutely fantastic).
  • The best software by far is RegexBuddy but the current version does not have an evaluation, it's buy it or nothing.

You can still download a slightly older version of RegexBuddy than current, that has a 7 day evaluation. If you need to do some regexes for work that may take less time than this, definitely go download a copy.

krolley
+10  A: 

Perl of course has fantastic Regex support, including this gem;

YAPE::Regex::Explain

PS D:\> perl -e "use YAPE::Regex::Explain; print "APE::Regex::Explain->new(qr/^\w{2,4}$/)->explain;"

The regular expression:

(?-imsx:^\w{2,4}$)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  \w{2,4}                  word characters (a-z, A-Z, 0-9, _)
                           (between 2 and 4 times (matching the most
                           amount possible))
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
PS D:\>
Ed Guiness
I wonder why this was down-voted. ( I up-voted it up to zero )
Brad Gilbert
+1. This is a great tip!
PEZ
This is a great tool! Is there anything like this for Ruby?
aaandre
A: 

Play out with python or some other programming language that has regexes in them:

import re

regex = re.compile(r"ab*c")

assert regex.match("ac")
assert regex.match("abc")
assert regex.match("abbc")

result = regex.match("abbbc?")
assert result
print dir(result)
assert result.end() == 5 and result.start() == 0
assert result.group(0) == "abbbc"

You may wonder what r"..." -means. It's not a special syntax for regexes, but for 'raw'. It simply means the string is not losing its escape characters. It still escapes the quotes though. So r"\" is invalid. By using this thing you don't need to do double escaping or use a different escape character for regexes. It's an useful feature to be found from a language of your choice.

Regexes are extremely useful. If you can turn your problem efficiently into a string matching problem. If it's simple enough you won't even need regexes! To give you an example how you could check out whether either player has won in tic-tac-toe, you could try something like next:

# BOARD STATUS:
# OXX
# _XO
# OX_

current_board = "OXX_XOOX_"

def has_won(board, mark):
    win = mark*3
    if win in (board[0:3], board[3:6], board[6:9]): return True
    if win in (board[0::3], board[1::3], board[2::3]): return True
    if win in (board[2::2][:3], board[0::4]): return True

assert has_won(current_board, 'X')
assert not has_won(current_board, 'O')

I think... with the same algorithms the python uses for regexes, you could also do pattern generators. It's not supported by python really, but if it were, you could then do stuff like re.generate(r"(A|T|G)+") or re.generate(r"lo+l")

For a long time I did not used regexes much because I didn't know how to write an efficient regex parser of my own. If you are like me, it's good idea to look into NFAs and DFAs. It's quite interesting how to parse those regexes into state machines, but the implementation aspect itself is somewhat boring in the end.

Cheery
+1  A: 

If you ever want to find out how a regex works in perl, you could always "use re 'debug';" or "use re 'debugcolor'".

perl -Mre=debug -e'/^\w{2,4}$/'
# use re 'debug';  /^\w{2,4}$/;
Compiling REx "^\w{2,4}$"
synthetic stclass "ANYOF[0-9A-Z_a-z{unicode_all}]".
Final program:
   1: BOL (2)
   2: CURLY {2,4} (5)
   4:   ALNUM (0)
   5: EOL (6)
   6: END (0)
floating ""$ at 2..4 (checking floating) stclass ANYOF[0-9A-Z_a-z{unicode_all}] anchored(BOL) minlen 2 
Freeing REx: "^\w{2,4}$"
Brad Gilbert
A: 

I use regular expressions a lot in Vim, and it has also helped me learn them. You can enable dynamic highlighting of the matched text as you type the expression.

Vim regular expressions are not completely "standard", but I find that every RE implementation has a few quirks to learn anyway.

Zac Thompson
A: 

I'm quite surprised that nobody has yet mentioned the "Regular Expressions Cookbook" by Jan Goyvaerts and Steven Levithan. Not only is it very down-to-earth and full of useful examples, it also explains very well what kinds of tasks are suitable to regexes and what kinds aren't (like parsing non-regular languages like HTML/XML).

It has also been translated into a few other languages:

Tim Pietzcker
+2  A: 

This site http://www.regular-expressions.info/ was a starting point for me

Faisal
+4  A: 

I would look at: http://www.regular-expressions.info/

and the book

Mastering Regular Expressions

Kevin
+1  A: 

Learn by doing!

Easiest way would be to break out your FireBug and learn the JavaScript flavor. You can do this by entering calls in the console, like

'foo'.match(/foo/)

Using a site like http://www.regular-expressions.info/ for reference.

Daniel Schaffer
+14  A: 

A good book often works for me, in this case it was Mastering Regular Expressions, 2nd Ed.

Mastering Regular Expressions, 2nd Ed.

Fabian Steeg
$29.69 - Is it worth the price?
Nick Brooks
@Nick: O'Reilly's books are **always** worth it. How much is that? One expensive meal? Definitely worth it.
voyager
@Nick Totally worth the price IMO - both a great introduction on how to approach regular expressions and a great reference, one of the books I keep getting back at.
Fabian Steeg
+1 O'Reilly is my recommendation also.
JYelton
+1, great book!
stereofrog
+1  A: 

I've liked what I've read thus far in the Regular Expressions Cookbook as recommended by Jeff Atwood.

Oren
A: 

if you learn by example http://regexlib.com/Default.aspx is a great place for a lot of already pre-defined regular expressions.

ANC_Michael
A: 

I learned 99% of my regex knowledge using "learn by doing" on find/replace in vim (your favorite editor probably supports it too). Now, the only thing that trips me up is switching between variants of regular expression languages - (i.e. which frikkin'symbols do I quote this time?).

Stephen
+1  A: 

I stumbled across the first in a ten part series of screencasts that looks promising. I can use them, but I'm not nearly as competent as I'd like to be.

I learn by example, by people explaining what they are doing as they do it, and this series does just that. It uses RegExr as a learning tool, which is cool.

Tim Post
+1  A: 

Visual Regexp is a tool that was mentioned to me recently on SO. It allows you to see directly what your proposed RE is actually matching and interactively update it. It should make an excellent companion to a tutorial text such as those already listed in other answers.

Donal Fellows
+2  A: 

I've only recently started using regex, earlier i (like you) just thought it was complicated.

One helpful resource is: http://regexpal.com/ where you can enter a text and try Regular Expressions in realtime. You can also try it live in great texteditors like Sublime Text and Notepad++.

Volmar
Having a simple interactive tool really hammers the book learning home.
uncle brad
A: 

I learned a lot by practicing using online testers such as http://rubular.com

Reading a book or at least some tutorials is necessary, but you regex are mastered by repetition and long headaches.

marcgg
+37  A: 

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.

If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.

Start simple

Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.

Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.

If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)

Order from the menu

Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.

The pattern . is special: rather than matching a literal dot only, it matches any character. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].

Think of character classes as menus: pick just one.

Helpful shortcuts

Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match non-negative integers: one way to write that is [0-9]+. Digits are a frequent match target, so you could instead use \d+ match non-negative integers. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).

The uppercased variants are their complements, so \S matches any non-whitespace character, for example.

Once is not enough

From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are

  • * (zero or more times)
  • + (one or more times)
  • {n} (exactly n times)
  • {m,n} (at least m times but no more than n times)

Putting some of these blocks together, the pattern [Nn]*ick matches all of

  • ick
  • Nick
  • nick
  • Nnick
  • nNick
  • nnick
  • (and so on)

The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.

Grouping

A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.

To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.

Alternation

Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).

For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

Escaping

Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.

Greed

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

For example, say the input is

"Hello," she said, "How are you?"

You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.

To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.

If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.

(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)

Anchors

Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.

Say you want to match comments of the form

-- This is a comment --

you'd write ^--\s+(.+)\s+--$.

Build your own

Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.

Greg Bacon
Awesome , Thanks!
Nick Brooks
@Nick You're welcome! I hope it helps.
Greg Bacon
`.` will not match "any character" (newlines are not matched) in many languages without a modifier (`s` in most flavors) or at all in other languages (eg ECMAScript). The meanings of `^` and `$` vary depending on context (beginning and end of line where the `s` modifier is not set, or the `m` modifier is; beginning and end of string otherwise; `^` negates a character class when placed at the beginning, and both are literal characters in a character class otherwise).
eyelidlessness
+2  A: 

This chart may help: I learned this from an instructor when he was teaching regular expressions:

               regex   filename metacharacter expansion (!regex, for comparison)
---------------------------------------------------------
Starts With  |  ^a   |            a*
Ends With    |   a$  |           *a
Contains     |   a   |           *a*
Exactly      |  ^a$  |            a

Then remember these:

^ // Start of line (only when at the start of the regex. Elsewhere, it means NOT)
$ // End point
. // 1 character
? // 0 or 1 ( boolean ) of the previous character
* // 0 or many of the previous character
[] // One character in a range

Work from left to right and practice, practice, practice

"Colou?r" // Pattern of `colo` followed by 0 or 1 `u` and an r. 

"^[A-Za-z0-9]{5}$" // Line with 5 characters of only letters and numbers
" [^aeiou][aeoiu][^aeiou] " // 3-letter word (consonant, vowel, consonant).
Atømix