tags:

views:

143

answers:

3

Let's say:

/(a|b)/ vs /[ab]/

+17  A: 

There's not much difference in your above example (in most languages). The major difference is that the () version creates a group that can be backreferenced by \1 in the match (or, sometimes, $1). The [] version doesn't do this.

Also,

/(ab|cd)/  # matches 'ab' or 'cd'
/[abcd]/   # matches 'a', 'b', 'c' or 'd'
Peter
( ) are also used to denote named groups, for reuse
OMG Ponies
@rexem - yep, thought about this before your comment and edited it :)
Peter
Just me being pedantic =)
OMG Ponies
nope, it's an important point - in fact, this is the main reason I'd use `(a|b)` instead of `[ab]`.
Peter
is (a-z) the same as [a-z]? I never seem to remember ..
lexu
@lexu not at all. `(a-z)` matches literal 'a-z' and creates a group; `[a-z]` matches any (single) lowercase letter (from a-z).
Peter
:-) thought as much .. and is another difference between () and [] .. (question title and body ask something slightly different(
lexu
-1 because that first bit ("there's not much difference") is very misleading. As the "Also" part of the answer makes clear, there is a HUGE difference between () and [] (although it doesn't affect Cheng's original question).
AAT
+4  A: 

First, when speaking about regexes, it's often important to specify what sort of regexes you're talking about. There are several variations (such as the traditional POSIX regexes, Perl and Perl-compatible regexes (PCRE), etc.).

Assuming PCRE or something very similar, which is often the most common these days, there are three key differences:

  1. Using parenthetical groups, you can check options consisting of more than one character. So /(a|b)/ might instead be /(abc|defg)/.
  2. Parenthetical groups perform a capture operation so that you can extract the result (so that if it matched on "b", you can get "b" back and see that). /[ab]/ does not. The capture operation can be overridden by adding ?: like so: /(?:a|b)/
  3. Even if you override the capture behavior of parentheses, the underlying implementation may still be faster for [] when you're checking single characters (although nothing says non-capturing (?:a|b) can't be optimized as a special case into [ab], but regex compilation may take ever so slightly longer).
Nicholas Knight
+3  A: 

() in regular expression is used for grouping regular expressions, allowing you to apply operators to an entire expression rather than a single character. For instance, if I have the regular expression ab, then ab* refers to an a followed by any number of bs (for instance, a, ab, abb, etc), while (ab)* refers to any number of repetitions of the sequence ab (for instance, the empty string, ab, abab, etc). In many regular expression engines, () are also used for creating references that can be referred to after matching. For instance, in Ruby, after you execute "foo" =~ /f(o*)/, $1 will contain oo.

| in a regular expression indicates alternation; it means the expression before the bar, or the expression after it. You could match any digit with the expression 0|1|2|3|4|5|6|7|8|9. You will frequently see alternation wrapped in a set of parentheses for the purposes of grouping or capturing a sub-expression, but it is not required. You can use alternation on longer expressions as well, like foo|bar, to indicate either foo or bar.

You can express every regular expression (in the formal, theoretical sense, not the extended sense that many languages use), with just alternation |, kleene closure *, concatenation (just writing two expressions next to each other with nothing in between), and parentheses for grouping. But that would be rather inconvenient for complicated expressions, so several shorthands are commonly available. For instance, x? is just a shorthand for |x (that is, the empty string or x), while y+ is a shorthand for yy*.

[] are basically a shorthand for the alternation | of all of the characters, or ranges of characters, within it. As I said, I could write 0|1|3|4|5|6|7|8|9, but it's much more convenient to write [0-9]. I can also write [a-zA-Z] to represent any letter. Note that while [] do provide grouping, they do not generally introduce a new reference that can be referred to later on; you would have to wrap them in parentheses for that, like ([a-zA-Z])

So, your two example regular expressions are equivalent in what they match, but the (a|b) will set the first sub-match to the matching character, while [ab] will not create any references to sub-matches.

Brian Campbell