Let's say:
/(a|b)/
vs /[ab]/
Let's say:
/(a|b)/
vs /[ab]/
There's not much difference in your above example (in most languages). The major difference is that the ()
version creates a group that can be backreferenced by \1
in the match (or, sometimes, $1
). The []
version doesn't do this.
Also,
/(ab|cd)/ # matches 'ab' or 'cd'
/[abcd]/ # matches 'a', 'b', 'c' or 'd'
First, when speaking about regexes, it's often important to specify what sort of regexes you're talking about. There are several variations (such as the traditional POSIX regexes, Perl and Perl-compatible regexes (PCRE), etc.).
Assuming PCRE or something very similar, which is often the most common these days, there are three key differences:
()
in regular expression is used for grouping regular expressions, allowing you to apply operators to an entire expression rather than a single character. For instance, if I have the regular expression ab
, then ab*
refers to an a
followed by any number of b
s (for instance, a
, ab
, abb
, etc), while (ab)*
refers to any number of repetitions of the sequence ab
(for instance, the empty string, ab
, abab
, etc). In many regular expression engines, ()
are also used for creating references that can be referred to after matching. For instance, in Ruby, after you execute "foo" =~ /f(o*)/
, $1
will contain oo
.
|
in a regular expression indicates alternation; it means the expression before the bar, or the expression after it. You could match any digit with the expression 0|1|2|3|4|5|6|7|8|9
. You will frequently see alternation wrapped in a set of parentheses for the purposes of grouping or capturing a sub-expression, but it is not required. You can use alternation on longer expressions as well, like foo|bar
, to indicate either foo
or bar
.
You can express every regular expression (in the formal, theoretical sense, not the extended sense that many languages use), with just alternation |
, kleene closure *
, concatenation (just writing two expressions next to each other with nothing in between), and parentheses for grouping. But that would be rather inconvenient for complicated expressions, so several shorthands are commonly available. For instance, x?
is just a shorthand for |x
(that is, the empty string or x
), while y+
is a shorthand for yy*
.
[]
are basically a shorthand for the alternation |
of all of the characters, or ranges of characters, within it. As I said, I could write 0|1|3|4|5|6|7|8|9
, but it's much more convenient to write [0-9]
. I can also write [a-zA-Z]
to represent any letter. Note that while []
do provide grouping, they do not generally introduce a new reference that can be referred to later on; you would have to wrap them in parentheses for that, like ([a-zA-Z])
So, your two example regular expressions are equivalent in what they match, but the (a|b)
will set the first sub-match to the matching character, while [ab]
will not create any references to sub-matches.