views:

93

answers:

4

I understand that

* = "zero or more"
? = "zero or more" ...what's the difference?

Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!

+2  A: 

? = zero or one

you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()

Manu
+1  A: 

? does not mean "zero or more", it means "zero or one".

Ben James
+4  A: 

?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!

If that’s indeed what your book says, then I advise getting a better book.

Inside parentheses (more precisely: right after an opening parenthesis), ? has another meaning. It starts a group of options which count only for the scope of the parentheses. ?: is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:

a(.)c

This is a regular expression that matches any three-letter string starting with a and ending with c. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:

m = re.search('a(.)c', 'abcdef')
print m.group(1)

This will print b, since m.group(1) captures the content of the first parentheses (group(0) captures the whole hit, here abc).

Now, consider this regular expression:

a(?:.)c

No capture is made here – this is what ?: after an opening parenthesis means. That is, the following code will fail:

print m.group(1)

Because there is no group 1!

Konrad Rudolph
+3  A: 

As Manu already said, ? means "zero or one time". It is the same as {0,1}.

And by ?:, you probably meant (?:X), where X is some other string. This is called a "non-capturing group". Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.) matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.) only the fourth character is stored in group 1, everything bewteen (?:.) is matched, but not "remembered".

A little demo:

import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4

m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4

You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com, but shorter would be: www\.(google|yahoo)\.com of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.

And for a better example to use non-capturing groups, see Chris Lutz's comment below.

Bart Kiers
It's not just for running faster. Consider this: `/(\w+)?\s+(\w+)/` There may be one group or two groups, and we don't know which one is which without checking whether or not the second exists. If we know we don't need the first group (basically we're just making sure it exists), we can use `/(?:\w+)?\s+(\w+)/` and then we know the data we want is always in group 1. (Replace `\w+` and `\s+` with more convoluted regexes to get a plausible real-world example.)
Chris Lutz
Excellent point Chris!
Bart Kiers