I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
? = zero or one
you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()
?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
If that’s indeed what your book says, then I advise getting a better book.
Inside parentheses (more precisely: right after an opening parenthesis), ?
has another meaning. It starts a group of options which count only for the scope of the parentheses. ?:
is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:
a(.)c
This is a regular expression that matches any three-letter string starting with a
and ending with c
. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:
m = re.search('a(.)c', 'abcdef')
print m.group(1)
This will print b
, since m.group(1)
captures the content of the first parentheses (group(0)
captures the whole hit, here abc
).
Now, consider this regular expression:
a(?:.)c
No capture is made here – this is what ?:
after an opening parenthesis means. That is, the following code will fail:
print m.group(1)
Because there is no group 1!
As Manu already said, ?
means "zero or one time". It is the same as {0,1}
.
And by ?:
, you probably meant (?:X)
, where X is some other string. This is called a "non-capturing group".
Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.)
matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.)
only the fourth character is stored in group 1, everything bewteen (?:.)
is matched, but not "remembered".
A little demo:
import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4
m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4
You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com
, but shorter would be: www\.(google|yahoo)\.com
of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com
. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.
And for a better example to use non-capturing groups, see Chris Lutz's comment below.