What is the difference between encasing part of a regular expression in () (parentheses) and doing it in [] (square brackets)?
How does this:
[a-z0-9]
differ from this:
(a-z0-9)
?
What is the difference between encasing part of a regular expression in () (parentheses) and doing it in [] (square brackets)?
How does this:
[a-z0-9]
differ from this:
(a-z0-9)
?
(…)
is a group that groups the contents like in math; (a-z0-9)
is the grouped sequence of a-z0-9
. Groups are particularly used with quantifiers that allow the preceding expression to be repeated as a whole: a*b*
matches any number of a
’s followed by any number of b
’s, e.g. a
, aaab
, bbbbb
, etc.; in contrast to that, (ab)*
matches any number of ab
’s, e.g. ab
, abababab
, etc.
[…]
is a character class that describes the options for one single character; [a-z0-9]
describes one single character that can be of the range a
–z
or 0
–9
.
The []
construct in a regex is essentially shorthand for an |
on all of the contents. For example [abc]
matches a, b or c. Additionally the -
character has special meaning inside of a []
. It provides a range construct. The regex [a-z]
will match any letter a through z.
The ()
construct is a grouping construct establishing a precedence order (it also has impact on accessing matched substrings but that's a bit more of an advanced topic). The regex (abc)
will match the string "abc".
[]
denotes a character class. ()
denotes a capturing group.
[a-z0-9]
-- One character that is in the range of a-z
OR 0-9
(a-z0-9)
-- Explicit capture of a-z0-9
. No ranges.
a
-- Can be captured by [a-z0-9]
.
a-z0-9
-- Can be captured by (a-z0-9)
and then can be referenced in a replacement and/or later in the expression.
[a-z0-9]
will match any lowercase letter or number. (a-z0-9)
will match the exact string "a-z0-9"
and allows two additional things: You can apply modifiers like *
and ?
and +
to the whole group, and you can reference this match after the match with $1
or \1
. Not useful with your example, though.
[a-z0-9]
will match one of abcdefghijklmnopqrstuvwxyz0123456789
. In other words, square brackets match exactly one character.
(a-z0-9)
will match two characters, the first is one of abcdefghijklmnopqrstuvwxyz
, the second is one of 0123456789
, just as if the parenthesis weren't there. The () will allow you to read exactly which characters were matched. Parenthesis are also useful for OR'ing two expressions with the bar |
character. For example, (a-z|0-9)
will match one character -- any of the lowercase alpha or digit.
Try ([a-z0-9]) to capture a mixed string of lowercase letters and numbers, as well as capture for back references (or extraction).